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Nonsampling errors in dual frame telephone surveys 


J. Michael Brick, Ismael Flores Cervantes, Sunghee Lee and Greg Norman ' 


Abstract 


Dual frame telephone surveys are becoming common in the U.S. because of the incompleteness of the landline frame as 
people transition to cell phones. This article examines nonsampling errors in dual frame telephone surveys. Even though 
nonsampling errors are ignored in much of the dual frame literature, we find that under some conditions substantial biases 
may arise in dual frame telephone surveys due to these errors. We specifically explore biases due to nonresponse and 
measurement error in these telephone surveys. To reduce the bias resulting from these errors, we propose dual frame 
sampling and weighting methods. The compositing factor for combining the estimates from the two frames is shown to play 


an important role in reducing nonresponse bias. 


Key Words: Nonresponse bias; Measurement error; Calibration; Sample allocation; Composite. 


1. Introduction 


Dual frame telephone surveys that sample from both 
landline and cell phones have become important in the U.S. 
to reduce undercoverage bias due to the incompleteness of 
the landline frame. Blumberg and Luke (2009) show that 
the percentage of households without a landline telephone 
but with at least one cell phone has increased dramatically in 
the last few years, reaching 20 percent by the end of 2008. 
Other countries also report substantial increases in the 
percentages of people who have only a cell phone (e.g., 
Kuusela, Callegaro and Vehovar 2008; Vicente and Reis 
2009). 

This paper uses data from the California Health Interview 
Survey (CHIS) and from 8 surveys conducted for the Pew 
Research Center for the People & the Press to examine the 
effects of nonsampling errors in dual frame telephone 
surveys. The CHIS 2007, a survey of California adults, was 
undertaken in late 2007. It combines a standard landline 
survey with a screening sample of cell phone numbers, 
where adults from the cell sample were interviewed only if 
they indicated that they did not have a landline number in 
the household. The Pew surveys are national surveys that 
interviewed an adult at all sampled residential telephone 
numbers from both landline and the cell samples. These 
surveys are described in more detail later. A number of 
important issues associated with the effect of nonsampling 
errors have been identified as a result of undertaking these 
dual frame telephone surveys — errors that have not been 
investigated fully in other studies. 

In the next section we review sample design, weighting 
and variance estimation methods developed for dual frame 
surveys, and describe CHIS 2007 and Pew dual frame 
telephone surveys that are used throughout the paper. The 


third section discusses nonsampling error in dual frame 
telephone surveys, and the effects these errors may have on 
the bias of estimates. Nonresponse and measurement errors 
have special importance in dual frame surveys. The fourth 
section studies sampling and estimation methods that may 
be used to alleviate bias in dual frame telephone surveys, 
and gives conditions under which these sampling and esti- 
mation approaches may be most useful. In this section we 
propose three estimators to reduce the bias due to differ- 
ential nonresponse within the overlap domain. The final 
section summarizes some of the findings for dual frame 
telephone surveys, and speculates on the applicability of 
these findings for other dual frame surveys. 


2. Background 


Most of the literature on dual frame surveys deals with 
the statistical theory related to efficiency in sample design 
and estimation. We summarize some of the key results in 
sampling, weighting and variance estimation, and then 
discuss the application of these methods to dual frame 
telephone surveys. 


2.1 Sampling 


The two sampling frames are denoted as A and B, and we 
assume the samples from these frames, S, and S,, are 
independent. The domain of units that are only in 4 is a, the 
domain of units only in B is 6, and the intersection 
containing the overlap units is ab. In our application to 
telephone surveys, A is the frame of landline numbers, B is 
the frame of cell phone numbers, a is the domain of 
households with only landline numbers, b is the domain of 
households with only cell phone numbers, and ab is the 
domain of households with both types of telephone service. 


1. J. Michael Brick, Westat and Joint Program in Survey Methodology at the University of Maryland. E-mail: mikebrick@westat.com; 
Ismael Flores Cervantes, Westat; Greg Norman, Westat, 1600 Research Boulevard, Rockville Maryland, 20850 U.S.A.; Sunghee Lee, Institute for Social 
Research, University of Michigan, 426 Thompson St. Ann Arbor, MI 48104, U.S.A. 
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Many important features of dual frame surveys depend on 
how units that could fall into both sampling frames (ab) are 
handled. 

A screening dual frame approach attempts to make ab = 
© by removing any overlap units before sampling, after 
sampling but prior to data collection, during data collection, 
or after data collection. Lohr (2009) gives examples of dual 
frame surveys using each of these approaches. 

Brick, Edwards and Lee (2007) and Fleeman (2007) 
describe screening in dual frame telephone surveys. While 
U.S. telephone numbers can be partitioned by whether they 
are cell or landline numbers, this frame does not identify 
whether those numbers correspond to households with only 
landlines (a), households with only cell phones (b), or 
households with both types of service (ab). In the surveys 
described by Brick, Edwards and Lee (2007) and Fleeman 
(2007), households sampled from the cell phone frame (B) 
were screened out during the data collection if they reported 
having a landline. The CHIS 2007 used this screening 
approach. 

A second approach is called an overlap dual frame 
survey, and units in the overlap could be sampled from both 
frames. In this case, estimation methods must be employed 
to avoid biased estimates because the overlap units have 
multiple chances of selection. Steeh (2004), Brick, Brick, 
Dipko, Presser, Tucker and Yuan (2007), and Kennedy 
(2007) discuss dual frame telephone surveys with overlap. 
In these cases, all respondents are interviewed irrespective 
of the frame they are sampled from. The Pew surveys use 
the overlap approach. 


2.2. Estimation 


In a screening survey, producing weights for estimating 
totals and characteristics of the entire population is simple, 
at least in the absence of nonsampling errors. Since ab = 
© and the sampling is independent, the units sampled from 
each frame are assigned weights that are the inverse of their 
selection probabilities from the frame from which they were 
selected. An overall estimate of the total is the sum of the 
weighted domain estimates, y,.. = ), + Y,, where p, = 
Dies, Fy; and Y, = Vics, 4;5;(b)y, where d; is the 
inverse of the selection probability and 6,(b) = 1 if i is in 
domain 6 and 0 otherwise. Variance estimation is also 
straight-forward since the two frames are strata and variance 
estimation methods appropriate for stratified samples can be 
applied. For telephone surveys, the landline sample units are 
weighted and added to the weighted cell phone sampled 
units, after the sampled cell phone units that have landlines 
are given a weight of zero. 

Screening during data collection, even in the absence of 
nonsampling errors, does have implications. For example, 
screened out households from B are not eligible for the 


Statistics Canada, Catalogue No. 12-001-X 


interview, and this increases data collection costs and the 
variance of estimated totals (Kish 1965, Chapter 11). The 
units that are screened out should also be treated properly as 
sampled units in variance estimation. 

Overlap surveys are more complex because units could 
be sampled from either of the frames. One estimation ap- 
proach is to combine the two domain estimates, », and ), 
with an average of the estimates of the overlap population 
from the separate frames. If $4, and $%, are the weighted 
estimates of the overlap domain from frame A and frame B, 
respectively, then an average or composite estimator is 
Pine = DV, +P, taps, +(1 - A) 93, with 0 <A <1. Follow- 
ing one (2009) we refer to these as average estimators. 
Assuming », and », are unbiased for domain a and 
domain b, and $4, and $%, are both unbiased for domain 
ab, then y.,,, is an unbiased estimator of the total. Estimates 
of means and other quantities can be produced using 
weights, where the weights for units in ab that are sampled 
from A are multiplied by 4 and the weights for overlap 
units sampled from B are multiplied by (1 — 4). The choice 
of the compositing factor, 4, has been investigated by 
many researchers and specific choices to reduce the vari- 
ance of the estimates have been suggested by Hartley (1962, 
1974) and Fuller and Burmeister (1972). All of average 
estimators require that the domain for all sampled units can 
be identified. 

Variance estimation with the average estimator is rela- 
tively simple if A is a fixed and not dependent on the 
selected sample. In this case, V(,,.) =V(S, + A945) + 
AG ioe). and each of these variances can be 
computed using variance estimation methods appropriate 
for the separate samples. If A is sample dependent, as with 
the Hartley and Fuller and Burmeister estimators, then vari- 
ance estimation is more complicated. The average esti- 
mators with a fixed A have been used in most dual frame 
telephone surveys with overlap. This approach is discussed 
below for the Pew surveys. 

Other estimation approaches that have been considered 
for an overlap survey include the single frame estimator 
(Bankier 1986; Kalton and Anderson 1986; and Skinner 
1991), and the pseudo-maximum-likelihood estimator 
(Skinner and Rao 1996; Lohr and Rao 2000; and Lohr and 
Rao 2006). Lohr (2009) reviews these estimators. Nearly all 
telephone surveys with overlap that we have seen use some 
versions of the average estimator, and it is the focus of this 
research. 


2.3 Telephone survey applications 


Data from CHIS 2007 are used to illustrate issues that 
arise in dual frame telephone survey that use a screening 
approach. The CHIS 2007 is a telephone survey of 
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California’s population conducted by the UCLA Center for 
Health Policy Research in collaboration with the California 
Department of Public Health, the California Department of 
Health Care Services, and the Public Health Institute. Data 
collection for CHIS 2007 was carried out by Westat in late 
2007 through early 2008. 

In the CHIS 2007 landline sample, one adult was sam- 
pled and interviewed in each household. In the cell phone 
sample, persons living in households with landline phones 
were screened out; an adult was sampled and interviewed in 
the cell sample if they lived in a household classified as cell- 
only. All responding households, including those screened 
out from the cell phone frame, were asked questions about 
telephone status and usage. Nearly 49,000 adult interviews 
were completed from the landline sample, and 825 inter- 
views were completed with cell-only adults. The landline 
sample response rate was 35.5% in the interview conducted 
with a household informant, and a 59.4% for the sampled 
adult. Respective response rates for the sample from the cell 
frame were 22.1% and 52.0%. Since CHIS 2007 used a 
screening approach, the reported response rate for the cell- 
only household informant interview is 30.5%. California 
Health Interview Survey (2009) discusses details of the 
study design, including differences between the overall cell 
phone response rate and the cell-only rate. 

In the CHIS 2007, the estimates from the cell phone 
sample are calibrated to the cell-only adult population in 
California at the screening stage (prior to nonresponse 
weight adjustment for the sampled adult). There are some 
difficulties with obtaining reliable control totals for the 
calibration at the state level that are discussed later. The two 
samples from the two frames are independent samples and 
are treated as such, until the ultimate stage where the two 
are combined and calibrated to independent totals of the 
entire adult population of California. This last calibration 
stage does not include telephone status as a domain. 

For dual frame telephone surveys with overlap, we use 
data aggregated from 8 surveys conducted for the Pew 
Research Center for the People & the Press in late 2008 
through early 2009. (The data for the Pew surveys were 
provided by Scott Keeter of the Pew Research Center for the 
People & the Press). All of these are surveys of the entire 
U.S. adult population. The surveys interview one adult in 
each sampled household from both frames using nearly 
identical questionnaires. Over the 8 surveys, nearly 11,300 
landline interviews and 3,800 cell phone interviews were 
completed. The response rates from the different surveys are 
very similar for the landline and the cell phone samples, 
with a median difference of one percentage point between 
the samples from the two frames. The response rates range 
across the 8 surveys and two frames from 17% to 24%. 


3 


In the Pew surveys, like most dual frame telephone 
surveys with overlap, a calibrated version of the average 
estimator is employed. Most surveys calibrate to both the 
telephone status domain counts (number of adults living in 
households with only cell phones, the number in household 
with only landlines, and households with both landlines and 
cell phones), and to demographic variables. The Pew studies 
are also calibrated to demographic totals including age, 
education, race/ethnicity, region, and population density of 
households with adults 18 years of age or older. In addition, 
they calibrate to totals of telephone status and, within the 
overlap domain to relative usage of landline and cell 
phones. 


3. Nonsampling errors 


Dual frame theory has been developed for ideal condi- 
tions — complete response and the absence of other nonsam- 
pling errors. Nonsampling errors affect the bias and preci- 
sion of the estimates in any survey, but their effects in dual 
frame surveys may be qualitatively different from those in 
single frame surveys for three reasons. First, nonsampling 
error in dual frame surveys often makes it difficult to 
determine the probability of selection of the sampled unit. 
This occurs when domain membership is ascertained during 
data collection, and nonresponse and measurement errors 
make it difficult to determine if a sampled unit is in the 
overlap. Second, nonsampling error in dual frame surveys 
may be linked directly, sometimes causally, to the sampling 
frame especially when data collection approaches differ by 
frame. Third, sampling from more than one frame adds 
complexity and creates more opportunities for nonsampling 
errors to have differential effects. 


3.1 Nonresponse effects 


Brick, Dipko, Presser, Tucker and Yuan (2006) show 
that the over-representation of the number of adults in cell- 
only households that occurs in almost all dual frame tele- 
phone samples may be due to nonresponse error. They 
suggest that this over-representation might be the result of 
differential accessibility — adults who rarely use cell phones 
are less likely to answer their cell phone than those who use 
their cell phones regularly. They did not find the same type 
of usage-related differential response rates in the landline 
sample. Kennedy (2007) further explores this type of 
nonresponse bias by examining the effects on specific 
estimates. 

To evaluate the differential representation, we compare 
the CHIS 2007 and Pew survey sample distributions by 
sampling frame and telephone usage to estimates from the 
National Health Interview Survey (NHIS). The NHIS is a 
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face-to-face survey sponsored by the National Center for 
Health Statistics with data collected by the U.S. Bureau of 
Census (the NHIS data were provided by S. Blumberg and 
J. Luke as a special tabulation). It is the only federal gov- 
emmment survey that provides estimates of telephone status 
and usage (Blumberg and Luke 2009). We define usage for 
the dual users (those in households with both types of phone 
service) as cell-mainly and land-mainly, where cell-mainly 
are persons who live in households that receive all or almost 
all their calls on their cell phone and land-mainly are the 
dual users in households that do not receive all or almost all 
their calls on their cell phone. 

To be more comparable to the CHIS figures, Table 1 
restricts the NHIS estimates to those from the West region 
only (NHIS estimates for California are not available). 
California accounts for 52 percent of the adults in the West. 
The NHIS figures are population estimates from the first six 
months of 2008, which is roughly contemporaneous to the 
CHIS data collection period. The CHIS figures are the un- 
weighted sample dispositions (the weighted dispositions are 
nearly identical). Even though CHIS used a screening 


Table 1 


approach, the telephone usage information was collected for 
every responding household in the cell phone sample. The 
table shows that the cell phone frame distribution over- 
represents the percent of adults in cell-only households and 
under-represents land-mainly adults when compared to the 
NHIS estimates. The landline respondents over-represent 
the land-only users and under-represent the cell-mainly dual 
users. The landline frame differences are more substantial 
than observed in a 2004 survey as reported in Brick ef al. 
(2006). 

Table 2 shows the same type of comparison of the NHIS 
national estimates from the second half of 2008 to the 
ageregated Pew survey unweighted outcomes (all the sur- 
veys were equal probability samples). Similar to the CHIS 
results, the cell frame distribution from the Pew surveys 
over-represents the percentage in the cell-only group and 
under-represents the land-mainly group, but the differences 
are less substantial than in CHIS. The Pew distribution from 
the landline sample mirrors the NHIS distribution closely, 
with a slight under-representation of the cell-mainly group. 


Percentage distribution of adults from CHIS 2007 and NHIS, by telephone usage 


Telephone usage NHIS West adults in 


landline households distribution cell phone households distribution 
Landline-only 23.5% 34.2% 4 3 
(1.5%) (0.2%) 
Dual — land-mainly 56.6% 53.2% 60.9% 18.5% 
(1.7%) (0.2%) (1.7%) (0.7%) 
Dual — cell-mainly 19.9% 12.7% 21.4% Blea 
(1.4%) (0.2%) (1.4%) (0.9%) 
Cell-only fs r 17.7% 50.3% 
(1.3%) (0.9%) 
Total 100.0% 100.0% 100.0% 100.0% 


CHIS 2007 landline 


NHIS West adults in CHIS 2007 cell phone 


Notes NHIS-West is the National Health Interview Survey, West Region, first 6 months of 2008, with percentages of all households with that type 
of service (thanks to S. Blumberg and J. Luke for this special tabulation). CHIS 2007 is the California Health Interview Survey, collected in 
2007 and early 2008, with unweighted percentages from the landline and cell frames. In the cell phone sample, usage was obtained in the 


screening interview. Approximate standard errors given in (). 


Table 2 


Percentage distribution of adults from Pew surveys and NHIS, by telephone usage 


Telephone usage NHIS adults in 


Pew surveys landline 


NHIS adults in cell Pew surveys cell 


landline households distribution phone households phone distribution 
Landline-only 19.4% 23.0% e 
(0.7%) (0.4%) s 
Dual — land-mainly 58.8% 62.7% 58.8% 42.3% 
(0.8%) (0.5%) (0.8%) (0.8%) 
Dual — cell-mainly 19.3% 14.4% 18.5% 24.0% 
(0.7%) (0.3%) (0.7%) (0.7%) 
Cell-only = a 22.7% 33.7% 
(0.7%) (0.8%) 
Total 100.0% 100.0% 100.0% 100.0% 


Notes NHIS is the National Health Interview Survey, second 6 months of 2008, with percentages of all households with that type of 
service. Pew surveys aggregates 8 surveys conducted for the Pew Research Center for the People & the Press from October 2008 
through March 2009, with unweighted percentages from the landline and cell frames. (Thanks to S. Keeter for providing these data). 


Approximate standard errors given in (). 
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Both of these surveys exhibit response distributions by 
frame and usage that are consistent with the accessibility 
conjecture of Brick ef al. (2006). This conjecture implies an 
ordering of those that are most accessible and likely to 
respond — ordering from the most likely to respond to the 
least likely to respond in the cell frame is cell-only, cell- 
mainly, and land-mainly. The special problem due to having 
two frames is that the ordering in the landline frame is 
different (land-only, land-mainly, cell-mainly), and the 
overlap units from the two frames could have very different 
response rates and biases. 

To examine nonresponse bias for a dual frame survey 
with overlap, suppose both the landline and cell samples are 
poststratified to telephone status domain totals prior to 
forming an average overall estimate. The poststratified esti- 
mator is 


Ip Ia + Fh +I HUME, (D 
where the poststratification factor for the land-only sample 
is N,/N,,, for the cell-only sample it is N,/N,, and the 
frame specific poststratification factors for the overlap are 
gi=N.,,/N4, and g?=N,,/N2, for the landline and 
cell samples, respectively. The Horvitz-Thompson (HT) 
estimators of the number of units are N, for the land-only 
domain, N, for the cell-only domain, and N4, and N2 
for the overlap domain from the two samples. Since we 
focus on the overlap, we write 

Vrsabiee Sas Lea 2a ia: (2) 
This poststratified estimator differs from the approach 
suggested by Lohr and Rao (2000), who average and then 
poststratify rather than poststratify and then average. Both 
approaches are consistent and approximately unbiased when 
there are no nonsampling errors. 

If we allow for differential response rates by telephone 
usage within the overlap such as those observed in dual 
frame telephone surveys, (2) is biased. Let W be the 
proportion of the overlap that are land-mainly, and let Y,, 
and Y,. be the population means for a characteristic for 
land-mainly and cell-mainly dual users, respectively. The 
bias of J. 4, 18 


7 ps, a 


DO) Nan y ) 


ml 
OR PON) Pere yy es) 


where 7, is the dual user’s response rate for the landline 
sample, 7, is the landline sample response rate of the land- 
mainly, r, is the dual user’s response rate for the cell 


sample, and +., is the cell phone sample response rate of the 
land-mainly. 


5 


To derive (3), we first define land-mainly and cell-mainly 
domain estimators from the landline sample as }4,(m/) = 
Na ye(ml) and $4,(mc) = Nia (mec), and from the 
cell sample as $4 (ml) = No yi (ml) and $3 (mc) = 


N2 y% (mc). Now assume (a) Ey4(ml) = Ey® (ml) = 


Y, and Ey4(mc) = Ey2(mc) = Y,,.; (b) covariances 
such as cov(N4,/N4,74,(ml)) = 0; and, (c) the expected 
domain totals are simple expressions such as EN4, = 
rN, EN4 =41,N,, etc. Since E(N,,/N4)94 = 
N» E{(N4 ¥4 (ml) + N4 y4(mc))/N4}, we can write 
E(Nay! Nap) Dap = Tit Nout You Tp Fig Nene =Na(ah 
W (Yi — Xnc+¥nc)» A corresponding expression can be 
written for Eg” >%,. Combining the two gives (3). 

These expressions assume that EY) (ml) = Y., and 
EY? (mc) = Y,,. An alternative approach that does not 
require this assumption is to posit that there is response 
propensity associated with telephone usage. The bias in this 
case would be a function of the response propensities from 
each frame. We do not examine the response propensity 
approach here. 

Expression (3) shows that when 0 < W <1, the bias of 
De eisezero il (aeons sone): Nir et (Li) 


r,r,| = 1. Condition (a) is basically the well-known condi- 
tion from single frame methodology. Condition (b) differs 
from single frame expressions because the bias depends on 
both the relative response rates and the compositing factor, 
i... The exception is when 7,7, ' = r.,7,', or equivalently 
Nn = hn, Where 7, is the landline sample response 
rate of the cell-mainly and r., is the cell sample response 
rate of the cell-mainly. In this form, this expression is 
comparable to the single frame bias expression that shows 
no bias exists when response rates are constant. 

More generally, the value of A affects the bias of the 
estimate, not just its variance. The bias can be eliminated by 
choosing 


r(r. — 1%, 
Aa neta) (4) 
Ky — hla 


Since the proportion of the total population covered by the 
landline frame is approximately equal to the proportion 
covered by the cell phone frame, most applications have 
used ~ =0.50 without considering its effect on bias. 

We can now apply these expressions to evaluate the bias 
of dual frame telephone estimator for CHIS, assuming the 
bias is only from differential nonresponse in the overlap. 
Using the data in Table 1, W =0.74 for the NHIS West 
region. We approximate 7,7; ' by the relative poststratifi- 
cation factor that is the ratio of the percentage of the CHIS 
landline sample classified as land-mainly to the percentage 
of the NHIS adults in landline households that are land- 
mainly; r,r,' is computed similarly for the cell phone 


¢ Cc 
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quantities. The quantities estimated from CHIS 2007 are 
given in Table 3, 7, 7; | =1.09 for the landline sample, and 
r,r,' =0.50 for the cell sample. As an example, suppose 
Y, =0.3 and Y,. =0.5, then the bias of the estimated 
percentage based on (3) is approximately 3 percentage 
points (a relative bias of about 9%) if 2 =0.5. Using (4), the 
bias is zero when A = 0.84; the bias becomes negative for 
larger values of i. 


Table 3 
Within overlap, relative poststratification factors for CHIS 
2007 and Pew surveys 


Relative 

poststratification 

factors* CHIS 2007 Pew surveys 
mit = 84/8 1.09 1.07 
toh = g* sre 0.50 0.84 
Pee = ae ieee 0.74 0.78 
ee a 2.42 Si] 


* Poststratification adjustment factor for telephone usage domain 
within overlap divided by overlap poststratification factor. 


The same computations can be done using the data from 
the Pew surveys, and the estimates are also shown in Table 
3. The parameters differ substantially from those computed 
from CHIS. Since the Pew studies are national, the NHIS 
estimate is W =0.81. The ratios of the Pew figures to the 
NHIS also have lower variability than those from the CHIS, 
with 7,7, ' =1.07 and r,r,' =0.84. As a result, the bias is 
only approximately 1 percentage points when 2 =0.5. The 
bias is zero when 2 = 0.7. 

To evaluate the biases more completely, estimates of 
Y,, —Y,,. are needed for characteristics from a dual frame 
telephone survey rather than making arbitrary assumptions 
as done in the example above. Blumberg and Luke (2009) 
give estimates that suggest these differences may be as 
substantial as the differences between the cell-only and 
landline population that have been documented extensively 
elsewhere. However, the NHIS estimates are from a face-to- 
face survey, not a dual frame telephone survey. 

Keeter, Dimock and Christian (2008) give estimated 
characteristics for dual telephone users by sampling frame, 
but not in sufficient detail to compute the biases. Keeter’s 
estimates indicate the estimates of dual users from the cell 
frame might be closer to the NHIS overlap estimates than 
those from the landline frame. However, since the response 
rates within the overlap are more variable from the cell 
frame than from the landline frame, a screening design that 
aims to reduce bias should exclude dual users from the cell 
phone frame rather than the landline frame when the cell 
frame has more variable response rates by frame. 
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Because of the potential bias in the overlap design, Brick 
et al. (2006) suggest using a screening design that excludes 
adults in dual usage households if they were sampled from 
the cell frame. In a screening design, a bias still exists due to 
the differential nonresponse in the landline sample of dual 
users by telephone usage. Substituting 4 = 1 into (2) and 
(3), the bias of },., 4, = 2° 9%, is 


DAV oan) = WN 4 Yoni = Tu a a ib) (5) 


The bias for this design and estimator is equivalent to single 
frame estimators, with the bias vanishing when either Y, 


m 


Y.. or the landline response rates are the same for the land- 
mainly and the cell-mainly. Notice that in this design, there 
is no compositing factor that can be used to control the bias. 

The bias of the screener estimator for CHIS 2007 is about 
half that of the average estimator using » =0.50 (the 
screener bias is 1.3 percentage points compared to the post- 
stratified average estimator using A =0.50 with bias of -3.3 
points). With the Pew parameters, the bias of the post- 
stratified average estimator and the screener estimator are 
nearly equal, with the bias of the screener slightly greater 
than the poststratified estimator (the screener bias is 1.1 
percentage points compared to -0.7 points for the post- 
stratified overlap). 

An issue mentioned earlier is that domain totals for 
poststratification, even for telephone status alone (land-only, 
cell-only, and dual domains), are not generally available for 
state or local area surveys. While small area estimates of the 
percentage of adults who are cell-only at the state level have 
been published (Blumberg, Luke, Davidson, Davern, Yu 
and Soderberg 2009), these do not give small area estimates 
for all three domains. The situation for telephone usage 
control totals is even more limited, with only national NHIS 
estimates published. Since the response rates in the cell 
frame typically vary by usage, some assumptions about the 
response rates in the cell sample may be useful to avoid 
substantial over-representation of cell-only and cell-mainly 
adults from the cell frame sample when using the overlap 
design. 


3.2 Measurement error effects 


In addition to nonresponse, some of the differences in the 
distributions shown in tables | and 2 could be due to 
measurement error. Before we discuss hypotheses related to 
measurement error, some of the key procedures in the 
surveys that could be related to measurement error are 
discussed. There are fundamental differences in the surveys, 
such as mode and topic. The NHIS is a face-to-face survey; 
the CHIS and Pew surveys are telephone surveys. Both 
NHIS and CHIS are health surveys, while the Pew surveys 
cover a broad range of topics. 
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The surveys also use different methods for collecting 
telephone status and usage. In the NHIS an adult family 
member is asked to answer questions about telephone status 
and usage for the entire family in a section of the interview 
about family characteristics. In the cell phone sample in 
CHIS 2007, the telephone status items are asked during the 
household screening, but the usage items are in the sampled 
adult interview. In the CHIS landline sample and the Pew 
surveys, the status and usage items are all in one of the last 
sections of the adult interview. This later placement is 
possible because no screening is involved. 

The sampling of an adult is another procedure that may 
interact with the measurement process. In the CHIS 2007, 
an adult is sampled from all adults who share the same cell 
phone. In the Pew surveys, and most other cell phone 
surveys, the cell phone is considered a personal device, and 
the person answering the phone is interviewed. In dual use 
households, the CHIS and Pew methods may result in 
different samples of adults. 

The greatest potential source of measurement error may 
be related to differences in the questionnaire items for 
telephone status and usage in the surveys. The items asked 
in each survey are given in the appendix. The approaches 
are quite varied. At least part of the difference in the studies 
is because the CHIS and Pew surveys are conducted by 
telephone and have prior information about telephone status. 

The items used in all three surveys are derived from 
items used in a supplement to the Current Population 
Survey (CPS) in 2004. As discussed in Tucker, Brick and 
Meekins (2007), cognitive testing and behavioral coding for 
the supplement identified a number of concerns with the 
CPS items, especially the usage item. Their testing found 
that a lack of a specific reference period, not having a code 
for ‘“‘half the time,’’ and difficulty in reporting for other 
members of the household made the usage item susceptible 
to measurement error. Tucker ef a/. (2007) also highlight the 
difficulty respondents had in reporting telephone status and 
usage for all household members in a single item. In 
addition, respondents had difficulty with understanding the 
meaning of “landline,” “regular,” a “working” cell phone, 
and the difference between using and answering a cell 
phone. 

These issues could affect domain classification, and thus 
bias estimates. For example, a 23-year-old living with 
parents might report being cell-only, while the parents might 
report dual usage. The effects on the estimates of these types 
of measurement errors in the NHIS and telephone surveys 
are difficult to predict, but inconsistent reporting in tele- 
phone and face-to-face administrations is not unexpected. 

Another possible measurement problem is the relationship 
between reporting telephone usage and the sampling frame 
from which respondents were selected. The hypothesized 


re 


error arises if the respondent, when asked which device they 
use to receive most of their calls, is more likely to choose 
the device they are using to do the interview. We do not 
believe this hypothesis has been tested, but any device effect 
of this nature would be expected to be in the same direction 
as the nonresponse effect. A dual user should have a greater 
likelihood of reporting as cell-mainly if sampled from the 
cell frame; they should be more likely to report as land- 
mainly if sampled from the landline. Thus, the bias dis- 
cussed earlier in the context of nonresponse could be arising 
due to the combined effect of nonresponse and device 
effect. Without being able to identify the magnitude of these 
sources of the bias, methods for reducing bias are unclear. 


4. Design and estimation approaches 
with nonsampling errors 


Because of the additional issues at play in dual frame 
surveys, sampling and estimation methods should be de- 
signed to account for the most important sources of error 
rather than focusing solely on sampling error. In this section 
we address sample design and estimation choices for dual 
frame telephone surveys within this larger error structure 
setting. 


4.1 Sample design approaches 


A key design decision for a dual frame telephone survey 
is whether to use a screening or full overlap sample design. 
We begin by exploring the optimal allocation of the sample 
for overlap and screening designs appropriate for dual frame 
telephone surveys when simple random samples are selected 
independently from the two frames and NV, > 0, N, > 0, 
and N., > 0. We assume throughout that the sample sizes 
are large enough to ignore the finite population correction 
factors. 

We use a linear expected cost function E(C) = c,(n,+ 
Np Cp C,_), Where c, is the cost of a landline interview, c, 
is the cost of a cell phone interview, and n, and n, are the 
number sampled from frames A and B, respectively. 
Assuming a constant element variance, o°, the variance of 
the overlap estimator is v., = 0° (N,(N, +4°N,,)ny. + 
N,(N, + (1-4) N,,,),'). The allocation that minimizes 
the variance with this cost function can be found by stan- 
dard Lagrangian methods, and is 


No. A 4 EC) an Ce giNy a rN») 


n, 9 = E(C)ty oN, (N, + (1-2) Na), © 


oO, 


where 


t = Jc,N,(N, +N) + cea (N, + 1-2)’ Ni): 


ab 
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For a screening design, a linear cost function appropriate 
for dual frame telephone surveys is E(C) = c4n, + 1,¢,, 
where c, =c, + N,N,'c,, m, is the sampled number of 
cell-only, and c, is the cost of screening. The variance of 
the screening estimator is v2, = o°(Njn;| + N,N,n;,'). 
The optimal allocation is just the stratified allocation given 
by n, 4 = E(C)N,(c,N, + Je,c,N,) | and 


E(C)N, 


Creo Ny + ON, 


np = 


yielding 
E(C)N, 


nl ee 
b 
Ai, Ge Ne ete 


cell-only interviews. 

With no nonsampling error and a fixed expected cost, the 
variance for the optimally allocated overlap design is smaller 
than the variance for the optimally allocated screener 
design when the cost of screening is large enough so that 
ve > N;\(t- NAO, ). When bias is included, the 
screening design may have smaller mean square error than 
the overlap design even when this condition holds. In the 
analysis below, we consider bias but do not account for all 
the effects of nonsampling error. For example, differential 
response affects the yield by the sampling frame from which 
the units are selected thus affecting the allocation and 
variance of the estimate. 

We compare the mean square errors of the screening and 
overlap designs under the CHIS 2007 parameters given 
previously. The mean square error is the sum of the variance 
and the bias squared. The variance is for the overall 
estimate, but the bias arises only from the overlap under our 
assumptions. The cost parameters for interviewing and 
screening cell phones are still not very well-known, but we 
use (c,= 1, c, = 3, c,= 2) based on information given by 
Keeter ef al. (2008) and Edwards, Brick and Grant (2008). 
The other parameters needed for the comparison are the 
distribution of the population by telephone status domain, 
and we approximate national values from the 2008 NHIS 
national estimates (NV, = 0.2N, N, =0.2N, and N,, = 
0.6). In this situation, the variance based on an optimally 
allocated overlap design with A =0.5 is slightly smaller 
than the variance for the optimal screening design (the ratio 
of the variances is 0.976). The variances of the two designs 
are approximately the same when the cost parameters are 
such that the screening from frame B is slightly less 
expensive (c,=1,.c, = 3,.¢,= 1.85). 

The screening approach has smaller mean square error 
than the overlap design under these conditions because the 
screening approach reduces the bias of the estimates from 
-3.3 percentage points to 1.3 points. Even a relatively small 
bias dominates the mean square error comparison between 
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the two designs, assuming the bias with the screening ap- 
proach is half the bias under the overlap design. This is the 
case because the variances of the overlap and screening 
designs are so similar. If we instead use the parameters from 
the Pew surveys, then the mean square error for the overlap 
design is smaller because its bias is lower than the bias of 
the screener design. 

The allocation to the frames with the overlap approach 
given by (6) assuming only sampling error is determined by 
the population parameters, the cost parameters, and the 
compositing factor. While this is not the optimal allocation 
when differential response rates are admitted, it is still useful 
to consider this situation since it is likely to be encountered 
frequently in practice. In this situation, the bias of y,, ,, due 
to differential nonresponse can be eliminated by choosing 
i. to satisfy (4). Based on the CHIS parameters, the value 
that eliminates this bias is A = 0.84. If we continue with the 
cost and population assumptions as above, but set A =0.84, 
then the optimal allocation given by (6) would select about 
75% of the sample from the landline frame. This contrasts 
with the allocation with 2 =0.5, in which only 63% is from 
the landline frame. The choice of the compositing factor is 
critical. When A =0.84 is used in conjunction with the 
optimal allocation for the CHIS parameters, the estimator is 
unbiased and has a variance that is about 5 percent less than 
the estimator from the optimal screener design. 


4.2 Estimation approaches 


An approach suggested by Brick ef a/. (2006) is to use a 
full overlap design with an average estimator for the overlap 
that is poststratified to telephone usage domain totals, as is 
done in the Pew surveys. This estimator is unbiased and 
consistent if the estimates within the domains are unbiased 
and the domain sample sizes are sufficiently large. 

The auxiliary data needed for this poststratification for 
the entire U.S. are now published regularly from the NHIS. 
As mentioned above, there are some concerns about using 
these data as control totals that deserve further study. The 
control totals needed for this estimator are the number of 
land-only adults, the number of cell-only adults, and the 
number of adults who are land-mainly and the number who 
are cell-mainly (N,,, and N,,,., respectively). This partitions 
the dual users into its two components. 

An alternative estimator of the overlap total using the 
same auxiliary data is 


I ae Ne TA 
Don mamicae aan ih) 
sep nN N, b ie tee | 


a 


eee Ay) Zn) Pe, (ml) 


A nA 


+ A SincDap (mc) + (1 — 2) gH Ip (mc), (7) 
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where the desulet PoSehanilieanett factors aren gi i= 
Neal Nos Sire = Ninel Nines Sint = = Nyy! Nine Sine = =Noe!N nes 

and 0 <A, <1,0<A, <1. This estimator, like the others 
Knsidered: thus far, is unbiased and consistent in the 
absence of nonsampling errors. Like (1), the estimates from 
each frame are poststratified before being averaged. The 
primary difference between (1) and (7) is that the dual users 
in (7) are partitioned and poststratified by usage; it also 
introduces different compositing factors within the overlap. 

The estimator y,,, may be useful when (1) is biased and 
usage control totals are available for poststratification. If the 
expected means within the usage domains are approxi- 
mately equal (EV;,(ml)=EY,,(ml)=Y,, and EV;,(mc)= 
E y5,(mc) = Y,.), then (7) is unbiased for any choice of 
O<A, <1 and 0 <A, <1. Since bias is not affected by 
the choice, different compositing factors may be used to 
reduce the variance of the estimates as is traditionally 
suggested in the dual frame literature. Table 3 shows that 
the proportion of respondents in the detailed usage domains 
varies considerably by the sampling frame, and this might 
make different compositing factors worthwhile. 

Because telephone usage control totals often are not 
available, we explored modifying (2) to use different 
compositing factors similar to those used in the overlap for 
(7). In this case, the goal would be to reduce bias rather than 
variance. A modified estimator of the overlap total is 


Vmodab = Ag P., (ml) ts Clas 2 2" $i, (ml) 


+i, 2°94, (mc) + (1-A,)2795,(mc). (8) 


However, this estimator may not be useful for reducing bias. 
Earlier, we showed that the eS of ¥,.4p Vanishes when 
hy = (te —My)(re My M1) '. The choice of 2, = hy 

i, in (8) eliminates the bias for both eatin! and ile 
mainly estimates, so that different compositing factors are 
not useful for bias reduction. The bias of the modified 
estimator is 


B(Proaad) = WN ap You Oatinty | + = Myra =D 


Naa ee nh + (Ue ha ae), 


where we make assumptions similar to those used earlier to 
approximate the bias of ¥,, ay. 

Another reason for studying an overlap estimator like (8) 
is because it is appropriate with sample designs that screen 
out land-mainly adults from the cell frame. This approach 
has been considered because the number of cell frame 
respondents that are classified as sumer iad may be small, 
and the assumption that E4,(m/) = Y,, may not hold and 
biases might result. 

Setting A, = 1, (8) reduces to 


i 


A AnAA AnAd 
Y mod i=1,ab =o Yi, (ml) a rg Van (mc) 


+ (1 =A)” Pop (me). (10) 


In this design, the landline sample alone is used to estimate 
both the land-only and the land-mainly totals. Both frames 
are used to one totals for the cell- phe If we assume 


EVap(mnl) = Yq and EY,,(mc) = EYn(mc) = Y,,., then 
we no longer need EY (ml) = Y,, for (10) to be hey 
As before, setting 2, =7(7,-—17,)(%%—%1%) elimi- 


nates the bias in the cell-mainly estimate. 


5. Discussion 


This exploration of nonresponse and measurement errors 
in dual frame telephone surveys suggests the effects of these 
errors may be very important. It leads us to believe that 
research on nonsampling errors to reduce biases may be 
more important than research that leads to incremental re- 
ductions in sampling error. 

The research also reveals shortcomings in our knowledge 
about nonsampling errors in these surveys. The direction 
and magnitude of the effects of measurement error are 
especially unclear. The inconsistencies in some of the 
findings for the CHIS 2007 and Pew surveys may well be 
due to measurement errors associated with the different 
approaches to data collection in these surveys, or to inter- 
actions due to the procedures. A thorough investigation of 
the error sources in dual frame telephone surveys is essential 
to improve the quality of dual frame telephone surveys, and 
we believe experiments to assess the effects of measurement 
error would be especially beneficial. 

We did find that the CHIS 2007 and Pew surveys consis- 
tently over-represented cell-only and cell-mainly users in 
samples from the cell phone frame, and the surveys had a 
slight over-representation of the land-only and land-mainly 
from the landline frame. However, the degree of over-repre- 
sentation of the domains differed by survey. In the CHIS, 
the over-representation could have led to substantial biases 
in the estimates if an overlap survey and a simple average 
estimator were used. The CHIS used a screening approach 
to reduce this potential bias, and this appears to have been 
largely successful. In the Pew surveys, the representation 
was less differential by frame and the potential for bias was 
smaller. In these conditions, the overlap approach may have 
smaller mean square error than a screening approach. 

Due to the potential for bias in dual frame telephone 
surveys with response patterns like the CHIS 2007, we 
examined sampling and estimation methods that could be 
implemented to deal with these biases. We found that 
screening approaches may be competitive or even pref- 
erable in dual frame telephone surveys when the bias due 
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to differential nonresponse or measurement error is large. If 
the bias is not negligible, this finding even holds with small 
sample sizes. However, these results depend on the choice 
of the compositing factor and the current practice of choos- 
ing 4 =0.5 should be reconsidered. An alternative is to 
choose the compositing factor to eliminate the bias of the 
average estimator. In many cases, this approach not only 
eliminates the bias, but also may be more efficient. 

We examined three estimators that deal with the bias due 
to differential nonresponse within the overlap domain. The 
first is ),,, which uses telephone status as domain control 
totals. This estimator eliminates the bias due to differential 
nonresponse when i, is used as the compositing estimator. 
This compositing factor indirectly uses information on the 
land-mainly and cell-mainly domain totals in computing 
response rates by domain and frame. A second estimator, 
Vey, eliminates this source of bias more directly by post- 
stratifying to telephone status and usage control totals. This 
estimator also permits the use of different compositing 
factors within the overlap domain to reduce the variance of 
the estimates. The third estimator that might be used to 
reduce bias is j,,4, but this estimator is more pertinent for 
a sample design that interviews the cell-only and the cell- 
mainly respondents from the cell frame, along with all 
respondents from the landline sample. This modified 
screening design and estimator might be especially attract- 
tive if there is concern that the mean of the land-mainly 
respondents from the cell frame sample is subject to 
nonresponse bias. All of these estimators could also be 
raked to additional demographic control totals after com- 
bining the two samples. 

Given our current state of knowledge, we believe there 
are important advantages with the full overlap design and 


A 


¥,, With A, chosen based on other similar surveys. It is 
worth observing that even though the CHIS and Pew 
surveys had very different response patterns, choosing a 
value of A, =0.75 would have reduced the bias substan- 
tially for both surveys. An advantage of this estimator over 
Ye) im general is that j,, is not poststratified to usage 
domain totals. We suspect that usage domain totals esti- 
mated from a face-to-face survey (NHIS) may be subject to 
substantially different errors than the estimates from tele- 
phone surveys. These differences could result in telephone 
survey estimates that are biased and have underestimated 
variances. For state and local surveys where even telephone 
status totals are not well-known, control totals for usage 
domains are likely to be highly suspect. 

A screening design with ),,. as the estimator has the 
advantage that it only requires control totals for the entire 
population and for the cell-only component, such as those 
estimated from the NHIS. A disadvantage is that, unlike the 


overlap estimators, there is no compositing parameter that 
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can be used to reduce the bias directly. The more elaborate 
screening design that interviews cell-only and cell-mainly 
from the cell frame and uses j,,, has merit, but there have 
been no studies that examine the conditions which would 
favor this estimator. 

A more complete analysis of the effects of nonsampling 
error would include other factors such as the effect of the 
differential response rates by frame. For example, we noted 
that samples from the cell phone frame yield more cell-only 
households than would be expected. These differential 
response rates can be addressed in allocating the sample, but 
we have not done so here. Our exploration of this shows that 
it results in larger allocations to the landline frame, increases 
the value of the compositing factor, and makes the screening 
designs more efficient relative to the overlap designs. The 
screening design and estimator are still subject to the bias 
noted above. 

While this research concentrated on nonsampling errors 
in dual frame telephone surveys, we suspect that similar 
issues exist in many other dual frame surveys, but that these 
issues may not be recognized. Lohr (2009) mentions non- 
sampling errors in general dual frame surveys and suggests 
comparing estimates of the overlap from each frame as a 
simple diagnostic test. We believe this is an excellent way to 
begin an investigation of problems associated with the 
overlap. 

As we noted earlier, the handling of the overlap is a 
major concern in dual frame surveys because nonsampling 
error may be associated with the sampling frame. Our inves- 
tigation shows that nonresponse and measurement errors are 
tied to the sampling frame in dual frame telephone surveys. 
It is very likely that dual frame telephone surveys that use 
different modes might experience analogous effects. For 
example, consider a dual frame household survey designed 
to survey members of a rare population. Suppose it uses an 
incomplete membership list with telephone numbers for the 
rare group as frame 4, and an area probability sample of 
households as frame B. Different response rates by sampling 
frame within the overlap might be expected, and these might 
be related to characteristics of the respondents leading to 
biases. Even within the overlap, there may be differences 
such as those related to how long the person has been a 
member of the organization used to create frame 4 and this 
might be related to characteristics such as age. This type of 
situation might parallel some of the within overlap domain 
issues identified in telephone surveys. Differential measure- 
ment errors related to the modes are also possible. 

Given the potential for bias in a dual frame survey, one 
of the important findings of our research is that the compos- 
iting factor, A, influences the bias as well as having an 
effect on the variance. While the choice of A typically has 
only a slight effect on the variance if 4 is in the vicinity of 
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the optimal value, the bias may be more sensitive to this 
choice. Thus, in dual frame surveys understanding how the 
choice of A affects the bias and the mean square error of 
the estimates is an important consideration. The other 
sampling and estimation methods discussed in this paper 
may also be applicable to other dual frame surveys. The 
usefulness of these methods depends upon understanding 
the nature of the nonsampling errors as well as the avail- 
ability of auxiliary data that could be used in calibration. 
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Appendix 
Telephone usage items 
National Health Interview Survey 


Nl. Js there at least one telephone inside your home that 
is currently working and is not a cellular phone? 


N2. Does anyone in your family have a working cellular 
telephone? 


N3. How many working cellular telephones do people in 
your family have? 


[If both N1 and N2 are ‘yes’ ask N4] 


N4. Of all the telephone calls that your family receives, 
ores 


All or almost all calls received on cell phones? 


Some received on cell phones and some on regular 
phones? 


Very few or none received on cell phones? 


California Health Interview Survey — Cell phone 
CC1. Js this cell phone your only phone or do you also have 
a regular telephone at home? 


[If the phone is a cell phone and they have a regular 
phone then ask CC2] 


CC2. Ofall the telephone calls that you receive, are ... 
All or almost all calls received on cell phones 


Some received on cell phones and some on regular 
phones, or 
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Very few or none on cell phones? 


[If respondent replies about half, record it] 


California Health Interview Survey — Landline 


CL1. Do you have a working cell phone? 
[If yes or they share a cell phone ask CL2] 

CL2. Ofall the telephone calls that you receive, are ... 
All or almost all calls received on cell phones 


Some received on cell phones and some on regular 
phones, or 


Very few or none on cell phones? 


[If respondent replies about half, record it] 


Pew Research Center for the People & The Press — 
Cell phone 


PC1. Now thinking about your telephone use... Is there at 
least one telephone INSIDE your home that is 
currently working and is not a cell phone? 


[If yes ask PC2] 

PC2. Ofall the telephone calls that you receive, do you get? 
[Rotate options—keeping SOME in the middle] 
All or almost all calls on a cell phone 


Some on a cell phone and some on a regular home 
phone 


All or almost all calls on a regular home phone 


Pew Research Center for the People & The Press — 
Landline 


PL1. Now thinking about your telephone use... Do you 
have a working cell phone? 
[If yes ask PL2] 

PL2. Ofall the telephone calls that you receive, do you get? 
[Rotate options—keeping SOME in the middle] 
All or almost all calls on a cell phone 


Some on a cell phone and some on a regular home 
phone 


All or almost all calls on a regular home phone 
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Maximum likelihood estimation for contingency 
tables and logistic regression with incorrectly linked data 


James O. Chipperfield, Glenys R. Bishop and Paul Campbell ' 


Abstract 


Data linkage is the act of bringing together records that are believed to belong to the same unit (e.g., person or business) 
from two or more files. It is a very common way to enhance dimensions such as time and breadth or depth of detail. Data 
linkage is often not an error-free process and can lead to linking a pair of records that do not belong to the same unit. There 
is an explosion of record linkage applications, yet there has been little work on assuring the quality of analyses using such 
linked files. Naively treating such a linked file as if it were linked without errors will, in general, lead to biased estimates. 
This paper develops a maximum likelihood estimator for contingency tables and logistic regression with incorrectly linked 
records. The estimation technique is simple and is implemented using the well-known EM algorithm. A well known method 
of linking records in the present context is probabilistic data linking. The paper demonstrates the effectiveness of the 
proposed estimators in an empirical study which uses probabilistic data linkage. 


Key Words: Data linkage; Probabilistic linkage; Maximum likelihood; Contingency tables; Logistic regression. 


1. Introduction 


Data linking, also referred to as data linkage or record 
linkage, is the act of bringing together records that are 
believed to belong to the same unit (e.g., a person or busi- 
ness), from two or more files. Data linkage is an appropriate 
technique when data sets must be joined to enhance dimen- 
sions such as time and breadth or depth of detail. Ideally, the 
linkage will be perfect, meaning only records belonging to 
the same unit are linked and all such links are made. How- 
ever, in many situations this does not happen, especially 
when linking records using fields that may have incorrect 
values, missing values or values that are legitimately dif- 
ferent for a given unit. 

Probabilistic linking is often used when the files contain 
a set of common variables or fields that constitute partial 
identifying information, but which do not constitute a 
unique unit identifier. In probabilistic linking (Fellegi and 
Sunter 1969) all possible links are given a score based on 
the probability that the records belong to the same unit. This 
score is calculated by comparing the values of linking 
variables that are common to both files. A link is then 
declared if the link score is higher than some cut-off. An 
optimisation algorithm may be used to ensure that each 
record on one file is linked to no more than one record on 
the other file. Probabilistic methods for linking files are now 
well established (see Herzog, Scheuren and Winkler 2007, 
Winkler 2001 and Winkler 2005) and there is a range of 
computer packages available to implement them. 

This is a consequence of the continued importance of 
linkage in a variety of fields, particularly relating to health 
and social policy. Recent examples of probabilistic data 


linkage from the Australian Bureau of Statistics (ABS) 
include linking records from the 2006 Australian Census of 
Population and Housing to a number of data sets including 
Australian death registrations (Australian Bureau of Statis- 
tics 2008), the 2006 Census Dress Rehearsal (Solon and 
Bishop 2009), and the Australian Migrants Settlements 
Database (Wright, Bishop and Ayre 2009). In the health 
arena within Australia, probabilistic linkage methods are 
used by the Western Australian Data Linkage Unit (Holman, 
Bass, Rouse and Hobbs 1999) and by the New South Wales 
Centre for Heath Record Linkage. Internationally, prob- 
abilistic methods are used by Statistics Canada (Fair 2004), 
USBC (see Winkler 2001), the U.S. National Center for 
Health Statistics (National Center for Health Statistics 2009) 
and by the Switzerland Statistical agency as part of their 
Longitudinal Study of People Living in Switzerland. 

Data linking offers opportunities for new statistical 
output and analysis. Naively treating a probabilistically- 
linked file as if it was perfectly linked will, in general, lead 
to biased estimates. Lahiri and Larsen (2005) and Scheuren 
and Winkler (1993) proposed methods to calculate unbiased 
estimates of coefficients for a linear regression model under 
probabilistic record linkage. More recently, Chambers, 
Chipperfield, Davis and Kovacevié (2009) and Chambers 
(2008) extended this work to a wide set of models using 
generalised estimating equations and, in the case of linking 
two files, allowing one file to be a subset of the other file. 

This paper develops a maximum likelihood (ML) ap- 
proach for analysis of probabilistically-linked records. The 
estimation technique is simple and is implemented using the 
well-known EM algorithm. The approach involves replacing 
the statistics, which would be observed from perfectly linked 
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data, with their expectation conditional on the linked data. 
Assuming this expectation is correctly specified, this ap- 
proach overcomes the following two limitations of the 
previous work. 

First, the previous methods assume only one linkage pass 
is made, whereas, probabilistic linkage usually involves 
multiple passes. In the latter case, records not linked in the 
first pass are eligible to be linked in the second pass, and 
only records not linked in the first two passes are eligible to 
be linked in the third pass, and so on. Each pass is designed 
to link records with a particular common set of charac- 
teristics. For example, the first pass may be designed to link 
records belonging to individuals who have not changed 
address between the reference dates of the two files. The 
second pass may be designed to accommodate changes of 
address. An example of such an approach is given in Table 
1 in section 5. 

Second, the previous methods assume that either the two 
files contain records from exactly the same units or the set 
of units on one file is a subset of those on the other file. The 
approach proposed can be used when one of the files to be 
linked is not necessarily a subset of the other file. This 
situation occurs frequently in practice and occurred in all the 
ABS examples mentioned above. It is also worth men- 
tioning that the files to be linked do not need to be related 
via a sampling mechanism, such as the smaller file being a 
random sub-sample of individuals from the larger file. 
Removing this restriction means that the two files may be 
administrative data sets. 

Consider linking two files denoted by X and Y. File Y 
contains the variable y on the population of individuals U , 
comprising 7, records. File X contains a vector of vari- 
ables, x, on the population of individuals U,, comprising 7, 
records. The target of inference is with respect to the 
population of n,, individuals, denoted by U,, =U, AU, 
who are common to File X and File Y. Files X and Y also 
contain a vector of fields, denoted by z, which are used to 
link the files using a probabilistic linkage algorithm. Of 
course, since we are considering probabilistic linkage here, 
the variable z does not constitute a unique unit identifier. 

Linking Files X and Y allows the joint distribution of x 
and y to be analysed. There are two sources of error that 
may affect analysis of the joint distribution using the linked 
file. These errors are referred to as incorrect links and 
unlinked records. 

A link is correct when the pair of linked records belong 
to the same individual. A link is incorrect when a pair of 
linked records do not belong to the same individual. Incor- 
rect links can artificially increase or decrease the correlation 
between x and y. An example of the latter is random 
linkage, where records on File X are randomly linked to 
records on File Y. 
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The i record on File X is defined as an unlinked 
record, if i € U,, and record i was not linked to a record 
on File Y. Or in other words, an unlinked record is a record 
on File X that could be correctly linked but was not linked at 
all (throughout this paper we use the convention of defining 
unlinked records in terms of File X, though the definition 
could equally be in terms of records on File Y). It may not 
always be possible to link a particular record on File X with 
much confidence that the link is correct. This situation may 
arise if a record is missing fields that are useful in estab- 
lishing the correct link. More generally, unlinked records 
may occur when some sub-populations are relatively 
difficult to link. For example, fields such as marital status, 
qualification, field of study, and highest level of schooling 
would generally not be as powerful when linking children as 
when linking mature adults. In this situation, the data linker 
must decide whether or not to link such records. We define 
the set of linked records by U, of size n° so that n’ < n, 
and n <n. 

The problem of analysis with unlinked records has clear 
parallels with the problem of unit non-response. Both lead 
to only a subset of legitimate records being available for 
analysis. The non-response mechanism in survey sampling 
is, in reality, a function of an unknown set of variables. Here 
however, we have the slight advantage in knowing that the 
probability of a record remaining unlinked can only be a 
function of z. The problem of non-response is often ad- 
dressed by weighting or by some conditioning argument. 
This paper considers both approaches to address the issue of 
unlinked records. 

There is a natural trade-off between the number of 
unlinked records and incorrect links (and consequently the 
bias that they introduce). Consider the case where File X is a 
subsample of File Y so that U,, = U,. Linking all records 
on File X will result, by definition, in no unlinked records 
but will result in the number of incorrect links being 
maximised. If instead we decide to only form links which 
we are very confident are correct, the number of incorrect 
links will decrease but the number of unlinked records will 
increase. In practice, finding the optimal balance between 
the biases due to unlinked records and incorrect links 
depends upon the analysis to be undertaken, the linkage 
methodology, and their interaction. For an in-depth practical 
discussion of this issue see Bishop (2009). 

It is worthwhile mentioning that the problem of making 
inference in the presence of incorrect record linkage is 
similar to the problem of making inference in the presence 
of misclassification of the outcome variable, which is a form 
of measurement error (see Fuller 1987). In the latter case, 
identifying assumptions separate the misclassification mech- 
anism from the model mechanism and are required since no 
error-free measurement is typically available. For example, 
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Hausman, Abrevaya and Scott-Morton (1998) considers 
misclassification in the outcome variable of a logistic 
regression model. Their identifying assumption is that the 
value of the, possibly misclassified, outcome variable is a 
particular function of the model’s explanatory variables. 
Our proposed method does not require the strong identifying 
assumptions of measurement error problems essentially 
because error-free measurement is available from a clerical 
sample which identifies correct links. The assumptions we 
make in this paper are outlined in section 3. 

Section 2 summarises the ML approach to contingency 
table and regression analysis under perfect linkage. Section 
3 considers the ML approach in the presence of incorrect 
links. Section 4 considers the ML approach in the presence 
of both incorrect links and unlinked records. Section 5 
demonstrates the effectiveness of many of the proposed 
estimators in an empirical study. Section 6 summarises the 
findings. 


2. Perfect linkage 


By way of introducing notation, this section discusses the 
case where the linkage is perfect. The estimating approach 
in this section is standard since, clearly, no special adjust- 
ment for incorrect linkage is required. Section 2.1 discusses 
estimating cell probabilities in a contingency table and 
section 2.2 discusses estimating regression coefficients in a 
logistic regression. 


2.1 Contingency tables 


For notation, it is convenient when considering contin- 
gency table analysis to transform x, to a single categorical 
variablez.so that wx gad (ts, Bot wyn Gee Detine ay, to. bea 
categorical variable on file Y, Rey aces, ene 

Consider the following factorisation of the distribution of 
x and y 


Dy, oP, (y |e IL) po) 


where II = Gomnar eis ite) T= Gee ee Pe 

T.,, 18 the probability that y= c given x = g. We as- 
sume that for every value of x there are C possible values of 
y which implies that the dimension of IT is CG. 

We now consider maximum likelihood estimation of the 
parameter II, characterising p,, under perfect linkage. 
Perfect linkage means that all records on file X are correctly 
linked to their corresponding record on file Y (i.e., there are 
no incorrect links and no unlinked records). Under perfect 
linkage, n,,, = n, and the set of linked records is denoted 

d = {(V;s x;): i =1, ..., m,,}. Under perfect linkage, the 
score function for 7, = Gon Eevee: ave.) character- 
ised by the multinomial distribution, is 
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Score(m,; d) = 


(Score (m,,; d), ..., Score(1,),; d), .... Score (%e_y,3 d))’ (1) 


where 
6 = 
Score(7,),5 d) = 2; j Wie} Mic ae G WicjxTic\x) 
se a =i 
= Ny Moy — Nix Meigs 
for c = 1, ..., C—1, where n,, = 2,W,.),, Wj), = 1 if y, = 
e and x, =x and w,,,=0 uaa’ and the category 


Rosle Anam to y = C 1s the arbitrarily chosen reference 
category. Solving Score(z,; d) = 0._, for m,, where 0, 
is a C—1 column vector of zeros, gives the maximum 
likelihood (ML) estimator 


rr = 9) 
Moly da Nay iis (2) 
where 
= Xj Wiots 
and 
A C=1 2 
Rai Le ane Tol 


2.2 Logistic regression 
Consider the logistic regression model 
E(y;) 710) (3) 


v, =1/ [1 + exp(B’x, )]. (4) 


L 


For (4) the K elements of x, are dichotomous variables 
and y, is now a dichotomous variable available from File 
Y. If we define x = (%,,...,%;,.--.X,)'> Y = Ys--Vi--> Jao) 
and 0 =(0,, ...0;,..., 0, y, the score matrix oe: "3 based 
on perfectly filed data, d, is 


Score (B; d) = x'(y — v). (5) 


Solving Score(B; d) = 0, for B gives the ML esti- 
mate B, which can be found by applying the well-known 
Newton-Raphson method. 


3. Analysis with incorrect links 


This section considers the situation where the linked file 
contains incorrect links but does not contain unlinked 
records. This occurs when all the records on File X are 
linked to a record on File Y (so n, < n,). Define the linked 
file of records by d= {d; = (y,, x,): i =1,..., 2,}, where 
y, is the value of y that is linked to record i on file X. To 
clarify, y, is the true value of y for record 7 on file X, so 
that y, = y, if record / is correctly linked. 
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The estimator given by (2), together with the assumption 
that y, = y, for i=, ...,.n,, 18 naive since:it treats the 
probabilistically linked file as if it were perfectly linked. In 
general the naive estimator will be biased. This section 
derives ML estimators which account for the fact that the 
data have been linked probabilistically or linked imperfectly 
in some way. 

It is common practice to select a subsample of the linked 
file, denoted by s., which is then reviewed clerically. The 
clerical review classifies a link, d,, as either correct or 
incorrect. Let 6, = 1 if record i on File X is correctly linked 
and 6; = 0 otherwise. 

Designing the clerical subsample is an important prob- 
lem, especially since clerical review is often a costly exer- 
cise. Possible uses of a clerical sample include estimating 
the proportion of correctly linked and unlinked records, to 
assist in deciding which records should be linked and which 
should remain unlinked, to ensure correct inference using 
d’ (i.e., the purpose of this paper), and to identify improve- 
ments to the way in which records are linked (in the ABS 
applications mentioned above, clerical samples were de- 
signed to ensure that each link had at least a specific 
probability of being correct). For the purpose of making 
correct inference using d’ selecting the clerical sample by 
simple random sampling is a reasonable approach. A more 
efficient clerical subsample could possibly be devised but 
there is no obvious way to do so. This is because the para- 
meters that we need to estimate to implement the ML 
method described in this paper depend upon the specific 
analysis (e.g., choice of y and x). Designing a clerical 
sample for all possible analyses would be difficult. 

We factorise the joint distribution p(y,, x;, 6,;) by 


x,) p(8,|X,). (6) 


where @ = B in the regression case, @ = II in the contin- 
gency table case. Factorisation (6) means that the links are 
incorrect at random (IAR) or, in other words, that the distri- 
butions y, |x, and 6, |x, are independent. Under this as- 
sumption it is only necessary to maximise the likelihood 
associated with the factor p(y, | x;; 8). Throughout this 
section we assume (6). It is important to point out that (6), 
and the development that follows, makes no assumption 
requiring File X to be a subset of File Y (e.g., when units on 
File X are a subsample of the units on File Y) or that the 
linkage process involves a single pass. We also assume that 
the correctness of linkage, 5,, is independent from record to 
record. 

As mentioned in the introduction, each linked record is 
assigned a score based on the probability that the records 
belong to the same unit. Denote the score by 7. A referee 
suggested using 7, to more accurately parameterise the 
distribution of 6, Technically this suggestion would 


P(; |X;; 8) P( 
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involve replacing p(6,|x,) with p(6;|x;,7%) in (6) and 
would likely reduce the variability of the ML estimators 
discussed in section 3. This would be a useful avenue of 
further research. 


3.1 Contingency tables 


Delines sy ¢— it yy = Chand acy ee 
otherwise. The expectation of w,.., given d. is 
E yg Wieie| =i =Y ) = 
Wretx idea Ue) a8 ssp if i¢s, 
=w,, if ies, and 6,=1 
=,, if ies, and 6,=0 
and ous is the probability that the i” link is correct given 
‘ie “and y; = y. The ML estimator of T.,, using the 


probabilistically linked data, d., is then 


fae = Fae(Defias) (7) 


where 
= 2 Miele (8) 
Wie = Wee Pe + A= P_.)tg if res, 
ae" es (9) 
= Toy, if ies, and 6,=0 
and 


Pe ete, ae Poel (10) 


The estimation procedure involves iterating between (7), 
(8) and (9) until convergence. Specifically the algorithm is: 


1. Calculate ae . from (10). 


2. Initialise a 


then 7‘ oy from (8). 


and then calculate Ww") from (9) and 


c|x 


PAN) 
Clee 


3. Calculate 7) from (7) using 7 


~(t) HY 


4. Calculate w cv and then calculate 


jx from ©) using 7 
to from (8) using w 


a 


oF — between 3 and 4 until convergence. 


The initialised value fia. could be set to the naive esti- 
mate of 7,,., which was described in section 3 above. How- 
ever, Our experience was that the choice of initial value was 


not important. 


3.2 Logistic regression 


Below we describe two ML methods (Methods 1 and 2) 
for estimating B using the probabilistically linked data, d’. 
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Both methods give unbiased estimates under the IAR as- 
sumption. The difference between the methods is the level 
of aggregation at which the probabilities of correct linkage 
are estimated. Method | requires these probabilities at a fine 
level of aggregation, which may mean its estimates are more 
variable than those of Method 2. 


Sal! 


The expectation of y conditional on the linked data is 


Method 1 
Fg Im=% =P) = 
if i¢s. 

if ies, andd6,=1 


y, Py + (Ss Py Jv, 


= Wy Pere se and 0710 


l 


and p.. is the probability that the i" link is correct given 
x= X; “and y, SH 

The ML estimator is then obtained by iterating between 
finding the solution, denoted by B, for B in (5) with y; 
replaced by )., where 


j=), P., +1 p_.)0, if ies, 


= 3, if ies, and 6,=1 (11) 
= 0; if ies, and 6,=0, 

; has the same form as v; except that B is replaced with 

B and p,. is the estimated proportion of correct links 

in the clerical sample for each combination of x and y . 


~Cr 


3.2.2 Method 2 


Let x'y in(5) have k™ element 
= ae 


The expectation of 7, conditional on 


n 
— U —z 
Vie EY Ba aE 
i 


where ry = Y;Xig- 
Hae 
d is 


Eg Tul X= % Veatye) = 


Ly, p+ =p, ) 0) %q ifi¢s, 


, (12) 
=a Meal te tec ee ATIC CO =a) 


=0v,x, ifies,and 6,=0 
and Pry" As the probability that a link with x, = 1 is correct 
given y, = y. The ML estimator is then obtained by 


iterating between finding the solution, denoted by B, for B 
in (5) with r, replaced by 7,, where 


i, = [vB peal eke if iés, 
ee if ies, and 6,=1 (13) 
= One if ies, and 6,=0, 


v7 


0; has the same form as v, except that B is replaced with 
B and p,» 1s the estimated proportion of correct links i in 
the ene sample for each combination of x and y 
Namely, if ya =a 


n n 7: 
= oy ye Xix9; »s Ms Nix 
ies, ies, 
andif y = 0, 


ai 
= px a v8, [Ee = 2 : 


ies, 1eS,, 
c f 


This approach requires only 2K probabilities to be 
calculated from the clerical sample and, on this basis, may 
be preferable to the approach in section 3.2.1 which requires 
more probabilities to be calculated. 


3.3 Estimating the variance using the bootstrap 


In this section we describe how to calculate the variance 
of the ML estimates of section 3. Denote the parameter of 
interest by 0, introduced earlier, and its ML estimate by 0. 
The Bootstrap (Rubin and Little 2003) estimate of the 
variance of 6, denoted by 9,,,, (0), is obtained by 


1. Taking a replicate sample of size n, from the linked 
file, d', by Sula random sampling with replace- 
ment. Denote the r™ replicate sample by d'(r). The 

r” replicate clerical sample is s,(r) = s, Ad (r). 

2. Calculating 0(r) which has the same form as 6 
except that d'(r) is used instead of d- and Sa) 
is used instead of s.. 

3. Repeating steps 1 and 2 R times, where R is the 
number of replicates. 

4. Calculating 


A 1 e ee ey. 
ele bh) — 6)(6(b) — 6Y. 


4. Analysis with incorrect links and 
unlinked records 


This section discusses two ways of analysing linked data 
in the presence of incorrect links and unlinked records. As 
mentioned in the introduction, the problem of analysis when 
there are unlinked records has clear parallels with the 
problem of unit non-response. Unlinked records may result 
in some characteristics on the linked file being over- or 
under-represented, thus leading to biased analysis. As 
discussed in more detail below, we use the fact that the 
mechanism giving rise to unlinked records can only be a 
function of z. 
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This section considers two methods of making inference 
in the presence of incorrect links and unlinked records, 
where linked records are indexed by i = 1, ..., n. (Re- 
member that the i" record on File X is an unlinked record 
if i € U,, and record i was not linked to any record on File 
Y.) The methods involve independently modelling the 
processes that determine which records are incorrectly 
linked and which are unlinked (see section 5 for an illustra- 
tion). These models require a subsample, denoted by s,,.. of 
all records on File X to be subjected to clerical review. 
Records in the subsample will be either linked to records on 
File Y or not linked. Linked records in the subsample must 
be identified as either correctly or incorrectly linked by the 
clerical review process. A subsample record which is not 
linked must be identified as either unlinked, or otherwise. 
Unlinked means the corresponding record was found on File 
Y but not linked to it, whereas otherwise indicates the 
corresponding record was not found on File Y and therefore 
assumed not to exist. The latter identification is potentially 
much more difficult and time-consuming than the former 
because it assumes some other error-free process is available 
for checking whether links, which were not made, are in fact 
correct. Unlinked records, by their nature, have limited 
information that can be used to identify the correct link, 
even during clerical review. Such a process may not exist, in 
which case adjusting for unlinked records would seem to be 
impossible. However, such a process may involve a clerical 
review of names appearing on the two files to be linked. For 
example, a clerical reviewer may realise that the names 
John O. Smith and Joh O. Smith on two different records 
may in fact be the same name (with an “n” missing in the 
latter case, perhaps due to errors in scanning), whereas the 
automated linking process may treat the two names as 
completely different. The clerical reviewer may then decide 
that the above two records correspond to the same 
individual and so therefore should be linked. (Bishop (2009) 
and Wright (2009) discuss the benefits of clerical review). 

The first method involves conditioning analysis on a 
variable ¢, = ¢,(z;). The variable ¢ is defined so that in- 
ference, in the presence of unlinked records, is unbiased 
conditional on ¢. The term ¢ is introduced since, in many 
cases, it would be impractical or unnecessary to condition 
on all the information in z. It is possible to give ¢, a non- 
missing value even when z, contains missing values. The 
exact form of the function ¢(z) would need to be justified 
after analysis of the subsample, s,.. For example, if persons 
under 20 years of age are under-represented in the linked 
file, ¢ would indicate whether a person is under 20 years of 
age. One approach to analysis is to include ¢ as a covariate 
in the regression model. The method in section 3 would then 
apply directly. However, analysts may like to integrate over 
¢ so that it does not appear in the logistic model or 
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contingency table. Section 4.2 discusses how to do this for 
contingency tables. Section 4.3 discusses a pseudo-likeli- 
hood approach which assigns weights to the linked records 
that attempt to account for any under- or over-representation 
of certain subpopulations in the linked data. Again, the 
choice of weight would need to be justified after analysis of 
the subsample, s,., which identifies unlinked records. This 
is discussed further in the context of the empirical study. 


4.1 Can we ignore unlinked records? 


Define the variable y, = 1 if record i on File X is un- 
linked and y, = 0 otherwise. Also let ¢, be a variable so 
that ¢, =1, 2, ..., h, ...f, where H is the number of cate- 
gories for ¢. We can ignore the fact that there are unlinked 
records if we are prepared to assume that, conditional on x,, 
the distributions of y,, y, and 5, are independent. Techni- 
cally this assumption leads to the factorisation, 


PV; X;, oO; Vip Gi ) ss 


PCY; | X;3 8) PO; 1X) POI x) PG) 
where again @ = B or II. It is worthwhile checking wheth- 
er this assumption is valid from the clerical subsample. If 
the assumption is reasonable, then there is no need to apply 
the methods in section 4.2 and 4.3 and the methods in 
section 3 will suffice. 

We may not be prepared to make the assumption 
mentioned above. We may however be prepared to assume, 
conditional on x and ¢, the distributions of y,, y; and 4, 
are independent. In this case, we say unlinked records are 
not ignorable. Technically this assumption leads to the 
factorisation, 


P(Vjs Xj> 93> Yr S) © 


PCY; |X $3 A) PO.1 x5 DPX, $6) PG) 
where A is the parameter for the distribution of y, | x,, ¢. 
If we are interested in p(y,|x,;;@) but not p(y,|x;, 63 A), 
one approach is to integrate out (i.e., average over) ¢, from 
the latter. 


4.2 Conditional Maximum Likelihood (CML) for 
contingency tables 


First. parameterise the joint distribution of y,, x, and ¢, 

by the multinomial distribution with parameter, A. Define 
' Len ' 

N= LL ee Users LL. is where TT, = (Wigs +> Mopoees Men) » 

Ton = (Mons 2-9 Mejgns +> Mejgn) and 1), is the probability 
that y, = c, x, = g and ¢, = A. The ML estimator of IT = 


(.,,) from section 2.1 when linkage errors are not 
ignorable is I = (7,,,.), where 
a H 
Tojx ie DE cia (14) 
h=1 
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where 


1 


jay, a ee | (15) 


Deish = Lieu, Wiclxnr Lieu, 18 the sum over the n_ linked rec- 

ords and 7,,, for h =1,...,H is the standard estimate of 

the marginal distribution of ¢ given x on File X. Further, if 
(MESKS 

Wiesel Wiergnd gry + AL Ps) Mang (16) 

Poi is the probability that the i" link is correct given 

eG, sth and y; aa Wr sh Slab eye nGeath 

and wit tags and Wiis = 0 otherwise. If ie s., then 

Wich = Wr xh = Wash if the link is determined to be cor- 

rect and w,,,,, = 1,,, if it is determined to be incorrect. 


The ML estimator 7 jx 18 obtained by iterating between 
(14), (15) and (16) until convergence. 


4.3. Pseudo-Maximum Likelihood (PML) 


This section discusses an alternative to the CML, dis- 
cussed in section 4.2, which is referred to as Pseudo-Maxi- 
mum Likelihood (see Chambers and Skinner 2003). It is 
essentially a weighting approach, which may be easier to 
implement than CML, and relies on the factorisation given 
in section 4.2. It involves solving weighted versions of the 
score functions, Score(z,;d)=0._, and Score(B; d) = 
0, for a, and B respectively, where a record’s weight 
equals the inverse of the probability that the record will 
remain unlinked. We denote the probability that record 7 
will not remain unlinked by f = E(y,) so that the unit 
weights are given by gq; a wheres here, 111), - 
Consequently the PML Sie for 7,), 1S 


c|x a Rex (Senge (17) 


where 71.) = Liev, JiWicjx- The estimate of is is obtained 
by iterating between updating w,,,, given by (7), and (17) 
until convergence. The PML estimator for B is the same as 
the ML estimator but where the estimating equation (5) now 
has unit weights of g,. One possible approach to estimating 
the accuracy of the PML estimates under perfect linkage is 
to use the Bootstrap method as described earlier, but where 
now the weight q, is introduced. 

To illustrate when unlinked records are not ignorable, 
consider linking a data base with personal employment 
status to another data base with education level. Also as- 
sume that age and sex variables, which are correlated with 
employment and education, are available on one of the data 
bases. After conducting a clerical review, we may find that 
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records for young males are 50% more likely to remain 
unlinked than records for females. This could be because 
males are less likely to provide their personal information, 
which is useful in linkage. Clearly, records for males on the 
linked file need to be given a weight double that for females 
in order for joint analysis of employment status and 
educational level to be unbiased. 


5. Empirical study 


A quality study conducted by the Australian Bureau of 
Statistics involved linking the 2006 Census of Population 
and Housing to its Dress Rehearsal. The Census Dress 
Rehearsal collected information from 78,349 persons and 
was conducted one year before the Census. The 2006 Cen- 
sus collected information from more than 19 million people. 

Within a short window, during which the 2006 Census 
data were being processed, name and address were available 
for both the Census and the Census Dress Rehearsal. During 
this time, the two files of person level records were linked 
using two different standards of information: 


Gold Standard (GS) used name, address, mesh block 
and selected Census data items. Mesh block is a geo- 
graphic area typically containing 50 dwellings. All 
names and addresses were destroyed at the end of the 
Census processing period. 

Bronze Standard (BS) used mesh block and selected 
Census data items (i.e., did not use name and address). 
This is a method proposed to be used for future linking 
work by the ABS. 


Full details of the quality study and the linkage metho- 
dology are given in Solon and Bishop (2009). The role of 
GS in the quality study is critical. It provides a benchmark 
against which the reliability of BS can be compared. The 
usefulness of the GS as a benchmark is due to the fact that 
name and address are powerful variables for the purpose of 
identifying common individuals on the Census and CDR 
and that it was subjected to thorough clerical review. As a 
result, GS is assumed to correspond to perfect linkage. 
Accordingly, differences between estimates based on GS 
and BS are interpreted as error. In other words, interest 
focuses on the reliability of BS relative to GS. 


5.1 Linking methodology 
5.1.1 Blocking and linking variables and the 1 — 1 
assignment algorithm 


This subsection provides an overview of the CDR-to- 
Census linkage methodology for BS. The linking method 
consisted of a sequence of passes, where each pass is 
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defined by a set of blocking and linking variables and a | - 1 
assignment algorithm. In the case of multiple passes, only 
records not linked in the first pass are eligible to be linked in 
the second pass, and only records not linked in the second 
pass are eligible to be linked in the third pass, and so on. 

Table | gives the blocking variables, denoted by “B” for 
the BS. For example, during Pass 1, a Census record and a 
CDR record are only considered as a possible link if they 
have the same value for mesh block. 

Linking variables are used to measure the degree of 
agreement between a record pair. A high level of agreement 
suggests that the likelihood of the record pair constituting a 
correct link is high. Table | gives the linking variables, 
denoted by “L”, for BS. For example, during Pass | of BS, a 
range of variables such as day, month and year of birth, 
country of birth and highest level of qualifications are used 
as linking variables. 


Table 1 

An example of blocking (B) and linking (L) variables used when 
linking 2006 Census data with the Census Dress Rehearsal. 
Different blocking variables were used on each of the two passes 


Variable Pass 1 Pass 2 
Day of birth IL 
Month of birth 

Year of birth 

Sex 

Indigenous status 

Country of birth 

Language spoken 

Year of arrival 

Marital status 

Religious affiliation 

Field of study of highest qualification 
Level of highest qualification 
Highest level of schooling 

Mesh block 


(eelge! tee ee Blea tee! ees io oabileel\ise) too}iso} 


oo Mme Here! Lex hileeetiileaen el creche) feel ema Alo” lol 


An output from each pass is a score for all record pairs. 
The score is a measure of the level of agreement between 
the pair of records. We defer the formal definition of score 
(for details see (3.6), Conn and Bishop 2006) but illustrate 
how it can be interpreted below. Consider BS in Pass 2 
where record pairs have the same full date of birth and sex; 
a record pair would be assigned a score of 23.5 if there is 
agreement on mesh block (+17) and year of arrival (+8) and 
disagreement on religion (—1.5) (in this example agreement 
status for other linking variables would contribute to the 
score but for illustration purposes we ignore them). The 
contribution to the score for agreement on mesh block (+17) 
is greater than that for agreement on year of arrival (+8) 
because the former is less likely to occur by chance alone. 

To formalise the aim of the linkage algorithm, denote the 
score for record 7 on the CDR and record 7 on the Census 
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during pass p of BS by r,,.. The set of all record pair 
scores r,, and the cut-off f, were used by the linking 
package Febrl (see Christen and Churches 2005) to 
determine the optimal set of links in pass p. The term f,, is 
the minimum value for the score in order for a record pair to 
be assigned as a link during pass p. The Febr/ algorithm 
seeks to maximise }7/,,,, subject to r,,, > f,,. Clearly, the 
number of links depends upon f,,. 

In what follows, we evaluate BS with two different sets 
of cut-offs, where a set of cut-offs is defined by the pass 1 
and 2 cut-offs. The first is referred to as the Very Low (VL) 
cut-off and is considered to be optimal cut-off since, for a 
range of cut-offs, its naive estimates were “closest” to the 
corresponding GS estimates (see Bishop 2009). The second 
cut-off is referred to as Ultra-Low (UL) and effectively 
seeks to maximise the number of linked CDR records. 
Below we refer to the two BS linked files by their cut-offs, 
VL and UL. 


5.1.2 Linking results 


GS linked 70,274 of the 78,349 CDR records. Under the 
assumption that GS corresponds to perfect linkage, there 
were 8,075 individuals with CDR records but no Census 
records. In reality the GS is not perfect. For a discussion on 
this see Bishop 2009. 

VL linked 57,790 CDR records. Of the 70,274 CDR 
records that were linked by GS, 13,784 remained unlinked 
by VL, 700 were linked incorrectly by VL and 55,790 were 
linked correctly by VL. Also, 1,300 CDR records were 
linked by VL but were not linked by GS- these are also 
incorrect links. So in total there were 2,000 (= 700 + 1,300) 
incorrect links. 

UL linked 74,350 CDR records. Of the 70,274 CDR 
records that were linked by GS, 2,811 remained unlinked by 
UL, 9,793 were linked incorrectly by UL and 57,670 were 
linked correctly by UL. Also, 6,887 CDR records were 
linked by UL but were not linked by GS. 

In summary, 97% of the VL links are correct and 20% 
(= 13,784/70,274) of the GS’ CDR records remain unlinked. 
The corresponding figures for UL are 78% and 4% 
(= 2,811/70,274). 


5.1.3 Modelling the probability of a link being 
correct 


All UL and VL links were known to be correct or 
incorrect (e.g., if a UL link is also made by GS then the UL 
link is correct. Otherwise the UL link is incorrect). As a 
result, p. in section 3.1 was known from GS. However, to 
simulate reality, P.,» Was estimated from a clerical sample 
of size 1,000 that was selected from the linked files by 
simple random sampling. 
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5.1.4 Modelling the probability of a record 
remaining unlinked 


Each CDR record linked by the GS was assigned a vari- 
able which indicated whether the record was unlinked by BS. 
Namely, if the record remained unlinked by BS then the indi- 
cator variable was assigned a ‘1’ otherwise a ‘0’. A logistic 
model was fitted using GS, where the response variable was 
the above indicator variable and the explanatory variables 
were obtained from the CDR. The more than 20 explanatory 
variables that are in the model were selected by standard 
forward-backward model selection. The explanatory vari- 
ables included educational level, language, born overseas, 
Indigenous status, and indicators of missing key variables 
such as meshblock. The resulting prediction resulted in f, 
and was used below to implement the Pseudo-ML method 
for both contingency tables and logistic regression. 


5.2 Results of tabular analysis 


Table 2 gives the results of cross-tabulating employment 
status of indigenous people as reported on the CDR and 
Census. Table 2a shows that the GS estimate of the propor- 
tion of indigenous people employed in the Census, given 
they were employed in CDR, is 78.3%. The corresponding 
naive estimate for VL, which assumes the data are perfectly 
linked, is 86.7%. Even after replacing each of the 700 
incorrect VL links by their corresponding correct link and 
discarding the 1,300 linked records for which no correct link 
exists, the naive estimate is largely unchanged at 86.0% 
(referred to as Gold Links in Table 2a). This shows that the 
difference between the VL and GS estimates is not so much 
due to incorrect links but is mainly due to unlinked records. 
This explains in part why the ML estimate (86.4%) for VL 
(see section 3.1), which only corrects for incorrect links, did 
not lead to much improvement. Conditional ML (CML) (see 


Table 2 


2h 


section 4) was considered in an attempt to reduce the error 
due to unlinked records that may have led to a misrepresent- 
tation, with respect to age and sex characteristics, in the 
linked file. The CML employment estimate was 86.6%. 
Unfortunately, CML did not make much of an improve- 
ment, indicating that the underlying mechanism generating 
unlinked records did not depend upon age and sex. PML 
estimates (see section 4) also did not make much of an 
improvement, indicating that the logistic model described in 
section 5.1.4 did not explain the mechanism generating 
unlinked records. Interestingly, the ML estimate using UL 
was 81.8%- by far the closest estimate to the GS estimate of 
78.3%. The UL’s main source of error is due to incorrect 
links, the type of linkage error which the ML estimator 
addresses. This indicates that correcting for errors due to 
incorrect links was much more successful than correcting 
for errors due to unlinked records. 

Standard errors of the GS, naive and ML estimates are 
shown in parentheses in Table 2a. For VL and UL, ML 
standard errors are respectively about 25% and 75% larger 
than the corresponding naive standard errors. Also, the ML 
standard errors for UL are slightly smaller than for VL 
indicating that the extra links made by UL were worthwhile. 
Clearly, naive inference with UL over-states the level of 
confidence in estimates. For VL, naive and ML standard 
errors and estimates are very close. 

Irrespective of the cut-off, the ML estimates in Table 2 
a,b and c are always closer to the GS estimates than the 
corresponding natve estimate. For example in Table 2b the 
ML estimates for VL is 36.9%, noticeably closer to the GS 
estimate of 37.9% than the naive estimate of 33.3%. Based 
on the estimates in Table 2 it could be argued that the choice 
of whether to use VL or UL is not so important, as along as 
the ML estimator is used. 


Percentages of Indigenous persons in various employment categories in 2006 given their employment category in 2005. For each linked 
data set, Very Low and Ultra Low, the estimation methods can be compared with the Gold 


Estimates for different methods and linked data set 


a: Indigenous persons employed in 2005 


Status in 2006 Gold Very Low Cut-off Ultra Low Cut-off 
Naive Gold links ML PML CML Naive ML 
Employed 78.3 86.7 86.0 86.4 86.6 86.1 PLS 81.8 
(he) (2.4) (3.0) (Gon) (2.9) 
Unemployed Shi} 4.2 4.3 4.1 4.1 4.2 6.3 33) 
(0.84) (1.2) (2.5) (0.82) (2.1) 
, : 17.8 9.0 9.6 9.3 9.1 9.6 21.6 14.7 
Not in the labour force (1.6) (2.4) (3.1) (1.6) (2.8) 
b: Indigenous persons unemployed in 2005 
Status in 2006 Gold Very Low Ultra Low 
Naive ML Naive ML 
Employed Dies) Zila Diez Boe 23.8 
Unemployed 34.4 38.9 36.4 32.3 38.0 
Not in the labour force 87.9 B33 36.3 32.3 38.0 
c: Indigenous persons not-in-the-labour force in 2005 
Employed hail 10.8 10.7 24.3 10.5 
Unemployed 5.8 7.6 7.4 6.3 5.8 
Not in the labour force 80.4 81.5 81.8 69.2 83.5 
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Table 3 is the same as Table 2 except that it describes 
analyses of linked records from all persons 15 and over 
rather than only Indigenous persons. Again the ML always 
makes an improvement for the UL, though this is not the 
case for VL. Table 4 gives the student status in 2006 for 
persons who were students in 2005. Again the ML generally 
makes the estimates closer to the corresponding Gold 
estimate, especially for UL. 


Table 3 

Percentages of all persons aged over 15 in various employment 
categories in 2006 given their employment category in 2005. For 
each linked data set, Very Low and Ultra Low, the estimation 
methods can be compared with the Gold 


Estimates for different 


methods and linked data set 


Status in 2006 Gold Very Low Ultra Low 
Naive ML | Naive ML 

a: Persons employed in 2005 

Employed 91.8 ee 220 89.7 92.4 

Unemployed 1.8 isi 1.6 1.9 1.6 

Not in the labour force 6.2 6.1 5.6) 8.3 5.8 

b: Persons unemployed in 2005 

Employed 44.5 443 440} 494 43.8 

Unemployed 26.8 26.6 275 22S eee6 

Not in the labour force 28.6 Phot, | wrote || Pie 7ses) 

c: Persons not-in-the-labour force in 2005 

Employed 

Unemployed 

Not in the labour force 


Table 4 

Student outcomes in 2006 for high school students in 2005 

Student Status in 2006 Gold Very Low Ultra Low 
Naive ML | Naive ML 

High School Student WS || Ps on) We ISG 

Completed High School 14.0 14.3 13.7) 14.7 14.1 

Did not Complete High School 6.6 6.3 6.6 7.8 6.2 


5.3. Simulation 


The following simulation study illustrates the problems 
with naive analysis and the benefit of using the method 
outlined in this paper. Files X and Y in the simulation, each 
containing 2,000 records, are independently generated 400 
times, where each generated file is denoted by X(r) and 
Y(r), and r =1, ..., 400. Specifically, on X(r) x, is ran- 
domly generated from the Bernoulli distribution with para- 
meter 0.5. On Y(r), y, is randomly generated from the 


i 
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Bernoulli distribution with parameter v,;, where v, = 
1/1 + exp(By + B,x;)], B = (Bo, B,) By = - 0.5, B, =1.5. 
The r' set of imperfectly linked data, d'(r), is generated 
by correctly linking each record on File Y(r) to one record 
on File X(r) with probability p = 0.8, 0.90, 0.95 and 1. For 
each r'" set of linked data a clerical sample of 300 links is 
selected. Each link in the clerical sample is assigned as 
being correct or incorrect. We summarise the performance 
of the ML estimator from section 3.2.2 and the naive 
method, which assumes there is no linkage error, by their 
95% coverage rates and their Mean Squared Error (MSE). 
The coverage rates are based on the standard errors calcu- 
lated from the Bootstrap described in section 3.3 with R = 
40 replicates. The MSE of 6 is calculated by 


MSE(®) = —— 


ain Bah wit 8) 


ipeil 


where B, is the ML estimate of B from d’(r). 

Table 5 shows that the naive approach has poor coverage 
rates, due to its significant bias in the presence of linkage 
error, and consequently a relatively high MSE. The cover- 
age rates for ML-Method | are very close to their nominal 
levels. The results show that, as the percentage of correct 
links reduces from 100% to 80%, the MSE of ML increases 
by a factor of about 3 for B, and B,. (The coverage rates 
and MSE of ML Method land 2 were very similar so only 
the former are reported). 


Table 5 
Mean squared error and coverage rates for linked simulated 
data, where correct linkage occurs with probability, p 


95% Coverage 
Mean Squared Error 
Rates 


0.8 0.9 0.95 1 OF8> O19 0195 
Naive Bo |0.024 0.010 0.0056 0.0043*|0.35 0.80 0.93 
B; | 0.11 0.038 0.016 0.011* |0.05 0.62 0.88 
ML-Method 1} Bp | 0.013 0.0078 0.0055 0.0043*|93.0 94.25 93.5 
B, |0.031 0.018 0.013 0.011* |96.0 94.5 96.25 


*when p=1 the naive and ML estimators are the same by 
definition. 


6. Discussion 


Data linkage is an appropriate technique when data sets 
must be joined to enhance dimensions such as time and 
breadth or depth of detail. Data linkage is increasingly being 
used by statistical organisations around the world. It is well- 
known that errors can arise when linking files, for example 
when applying probabilistic linking methods. However, 
there has been little work reported in the literature about 
how to make valid inferences in the presence of such errors. 
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This paper provides methodological and practical advice to 
support analysts in this area. 

In general, naively treating a linked file as if it were 
perfectly linked will lead to biased estimates. The analyst 
should only use the naive approach when both the number 
of unlinked records, defined as records that could be cor- 
rectly linked but were not linked at all, and the number of 
incorrect links are negligible. This paper has presented a 
maximum likelihood approach to making valid inferences in 
the presence of both sources of error. The approach uses the 
well-known EM algorithm and is easy to apply in practice. 
The method can be applied when one of the files is not 
necessarily a subset of the other and when the linkage 
involves multiple passes. These situations often arise in 
practice, including many recent examples in the Australian 
Bureau of Statistics. The empirical study shows that the ML 
approach makes significant and meaningful improvements 
to the estimates from the linked data. 

In the special case where File X is obtained by taking a 
random sample from File Y, the estimation procedure 
described is not ‘full’ maximum likelihood. This is because 
it does not use the fact that population totals for File Y are 
known. While inference using the method described here 
are still valid in this case, it could perhaps be made more 
efficient (see Scott and Wild 1997). 
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Hierarchical Bayes small area estimation 
under a spatial model with application to health survey data 


Yong You and Qian M. Zhou ' 


Abstract 


In this paper we study small area estimation using area level models. We first consider the Fay-Herriot model (Fay and 
Herriot 1979) for the case of smoothed known sampling variances and the You-Chapman model (You and Chapman 2006) 
for the case of sampling variance modeling. Then we consider hierarchical Bayes (HB) spatial models that extend the Fay- 
Herriot and You-Chapman models by capturing both the geographically unstructured heterogeneity and spatial correlation 
effects among areas for local smoothing. The proposed models are implemented using the Gibbs sampling method for fully 
Bayesian inference. We apply the proposed models to the analysis of health survey data and make comparisons among the 
HB model-based estimates and direct design-based estimates. Our results have shown that the HB model-based estimates 
perform much better than the direct estimates. In addition, the proposed area level spatial models achieve smaller CVs than 
the Fay-Herriot and You-Chapman models, particularly for the areas with three or more neighbouring areas. Bayesian 


model comparison and model fit analysis are also presented. 


Key Words: Area level model; Bayesian model comparison; Disease rate; Gibbs sampling; Hierarchical spatial model; 
Posterior predictive model checking; Sampling variance. 


1. Introduction 


Model-based small area estimation methods have been 
widely used in practice due to the increasing demand for 
precise estimates for local regions and various small areas. 
In general sample surveys are designed to provide reliable 
estimates for large regions or aggregates of small areas such 
as the whole nation and provinces. Direct survey estimates, 
based only on the area specific sample data, usually provide 
reliable estimates of the parameter of interest for those large 
areas. For small areas, particularly some small geographical 
areas or specific small domains, direct estimates are likely to 
yield large standard errors because of the small sample sizes 
in those small areas. Therefore in making inference for 
small areas, it is necessary to borrow strength from related 
areas to form indirect estimates that increase the effective 
sample size and thus increase the precision of estimates. It is 
now generally accepted that the indirect estimates should be 
based on explicit models that provide links to related areas 
through the use of supplementary data such as census counts 
or administrative records; see, for example, Rao (2003) and 
Jiang and Lahiri (2006) for more discussion on model-based 
small area methods. The model-based estimates are ob- 
tained to improve the direct design-based estimates in terms 
of precision and reliability, ie., smaller coefficients of 
variation (CVs). There are two broad classifications for 
small area models: area level models and unit level models. 
Area level models are based on area direct survey estimates 
and unit level models are based on individual observations 
in small areas. In this paper we focus on area level models 


that borrow strength across regions to improve the direct 
survey estimates. 

Among the area level models, the Fay-Herriot model 
(Fay and Herriot 1979) is a basic and widely used area level 
model in practice to obtain reliable model-based estimates 
for small areas. The Fay-Herriot model basically has two 
components, namely, a sampling model for the direct 
estimates and a linking model for the parameters of interest. 
The sampling model involves the direct survey estimate and 
the corresponding sampling variance. The Fay-Herriot model 
assumes that the sampling variance is known in the model. 
Typically a smoothed estimator of the sampling variance is 
obtained and then treated as known in the model. Wang and 
Fuller (2003) and You and Chapman (2006) considered the 
situation where the sampling variances are unknown and 
modeled separately by direct estimators. In this paper we 
will consider both the smoothing and modeling methods for 
the sampling variances in the sampling model. 

The linking model relates the parameter of interest to a 
regression model with area-specific random effects. In the 
Fay-Herriot model, the area random effects are usually 
assumed to be independent and identically distributed (did) 
normal random variables to capture geographically unstruc- 
tured variations among areas. However, in some small area 
applications, particularly in public health estimation prob- 
lems, geographical variation of a disease is a subject of 
interest, and estimation of overall spatial pattern of risk and 
borrowing strength across regions to reduce variances of 
final estimates are both important. Thus, it may be more 
reasonable to construct spatial models on the area-specific 
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random effects to capture the spatial dependence among 
them. The spatial models are generally used in health related 
small area estimation, and various spatial models have been 
proposed for small area estimation (e.g., Cressie 1990; 
Ghosh, Natarajan, Stroud and Carling 1998; Maiti 1998; 
Ghosh, Natarajan, Walter and Kim 1999; He and Sun 2000; 
Moura and Migon 2002; Singh, Shukla and Kundu 2005; 
Souza, Moura and Migon 2009). Best, Richardson and 
Thomson (2005) provided a comprehensive review on spa- 
tial models for disease mapping. Rao (2003) also discussed 
several spatial small area models. 

The objective of this paper is to consider spatial correla- 
tion small area models and illustrate the usefulness of these 
models through an application to health survey data. The 
paper is organized as follows. In section 2, we first study 
area level models including the Fay-Herriot model and 
spatial correlation linking models. Then in section 3 we 
propose hierarchical Bayes (HB) small area models with 
spatial correlation and obtain HB inference for small area 
parameters through the Gibbs sampling method. In section 
4, we apply the proposed models to the analysis of small 
area data from the Canadian Community Health Survey. We 
compare the performance of the model-based estimates with 
the direct design-based estimates, and moreover, we 
compare the proposed models with the Fay-Herriot model 
and the You-Chapman model (You and Chapman 2006) to 
investigate the effects of incorporating spatial structure on 
the area-specific random effects. Bayesian model compari- 
son and model fit analysis are also provided. Finally in 
section 5, we offer some concluding remarks. 


2. Small area models and inference 


2.1 Fay-Herriot model 


a 


Let , denote the parameter of interest for the i" area, 
where i =1,..., m, and m is the total number of areas. 
The Fay-Herriot model assumes that the 0,’s are related to 
area specific auxiliary data x, = (Xj, .-.. Xj, )’ through a 
linear regression model as follows: 


il 


OG =< Dave (1) 


where B = (B,,...,B,)' is the p x1 vector of regression 
coefficients, and the v,’s are area-specific random effects 
assumed to be iid with E(v,) = 0 and Var(v,) = 6%. The 
assumption of normality may also be included. This model 
is referred to as a linking model for 0,. The Fay-Herriot 
model also assumes that a direct survey estimator y,, which 
is usually design-unbiased for the parameter of interest 0,, 
is available whenever the area sample size n, > 1. It is 
customary to assume that 
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J; = 0, + é;, — ik Aa a! (2) 


where e,’s are the sampling errors associated with the direct 
estimator y,. We also assume that the e, ’s are independent 
normal random variables with mean E(e, |0,) =0 and 
sampling variance Var(e, |9,) = o>. The model (2) is re- 
ferred to as a sampling model for the direct survey estimator 
y,. Combining these two components (1) and (2) leads to a 
linear mixed effects model (the Fay-Herriot model) as 


Ve XP ee eee 7 (3) 


In the basic Fay-Herriot model (3), the sampling variances 
G, are usually assumed as known, which is a very strong 
assumption. Generally, we can use direct sampling variance 
estimates from the survey data, however, these direct esti- 
mates are unstable if sample sizes are small. Therefore, in 
practice, a smoothed estimator of 6? is used in the model 
and treated as known. A generalized variance function is 
usually applied in practice to obtain a smoothed estimator 
for the sampling variance, e.g., Dick (1995). In recent years, 
a method of smoothing design effects has been developed 
and used in practice to obtain smoothed variance estimators 
(e.g., Singh, Folsom and Vaish 2005; You 2008a; Liu, 
Lahiri and Kalton 2008). In particular, You (2008a) applied 
an equal design effects modeling approach to obtain smooth 
estimates of sampling variances. The design effect for the 


«th 


i’ area may be approximately written as 


. 
Sa , 
— i Olan lee Sear TIs 


deff, = 


ar 
Ss 


ri 
where s; is the unbiased direct estimate of sampling 
variance based on the complex sampling design, and ae is 
the estimate of sampling variance based on the assumption 
of simple random sampling design. For each area, based on 
the assumption of a common design effect, a smoothed 
factor deff can be obtained by deff = >", deff,/m. Then a 

can be obtained 


ee, 
Z 


smoothed sampling variance estimate 6 
as 6? =\s-, - deff. 

Instead of plugging in the smoothed estimates of 
sampling variances in the model, alternatively we can model 
the sampling variance directly. In the papers by Wang and 
Fuller (2003) and You and Chapman (2006), they assume 
the sampling variance oc; unknown and estimate o? by an 
unbiased direct estimator s,;, which is independent of the 
direct survey estimator y, They also assume that d,s? ~ 
9; %;,, where d, = n, — 1, and n, is the sample size for the 
i" area. You and Chapman (2006) considered the full HB 
approach with the Gibbs sampling method which auto- 
matically takes into account the extra uncertainty associated 
with the estimation of o°. In this paper, we consider both 
the smoothing and modeling approaches for the sampling 
variances. 


i 
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2.2 Spatial models 


To incorporate spatially correlated random effects in the 
linking model, a simple and obvious way is to add a spatial 
random effect u, in the independent linking model (1) as 
follows: 


0, =x,'B+, +4, (4) 


where u;’s follow the well known intrinsic conditional 
autoregressive model given as 


' oO 
jJ#1 u 
U; | Ue N Re > (5) 
) 
My My 
j#i J#i 


where u_, denotes the values of spatial random effects wu j aS 
in all other areas with j #i, weights w, are fixed 
constants, and o7 is a unknown variance component. In 
practice, a common choice of w, is to let w, = 0 unless 
areas i and 7 are neighboring areas (i.e., share a common 
boundary), in which case w, =1. The model (4) is 
proposed by Besag, York and Mollie (1991) to separate 
spatial effects from overall heterogeneity in the areas. In 
model (4), independent random effects v, capture geo- 
graphically unstructured heterogeneity among areas, and 
spatial random effects u, capture spatial dependence bet- 
ween areas. In this way, the degree of overall spatial depen- 
dence can be expressed based on the proportion of the total 
variation in v, + u, captured by each component. 

In practice, it is often unclear how to choose between an 
unstructured model (e.g., the basic linking model) given by 
(1) and a purely spatially structured model (e.g., intrinsic 
autoregressive model) given by (5). For model (4), posterior 
inference about the spatial dependence is based on the 
proportion of the total variation in the sum of v, + u, 
captured by each component. However, although the 
univariate conditional distributions of the spatial component 
(5) are well defined, the corresponding joint distribution is 
improper (with undefined mean and infinite variance). 
Moreover, the model (4) has a potential identifiability 
problem where only the sum of the random effects v, + u, 
is well identified by the data; see, for example, Best ef al. 
(2005), for a more detailed discussion. 

Alternatively, we can consider another spatial para- 
meterization studied by Leroux, Lei, and Breslow (1999) 
and MacNab (2003), which avoids the identifiability prob- 
lem encountered with the model (4). Let 0, = x'B +b, 
and b = (4,..., b,,)’. Following Leroux etal. (1999) and 
MacNab (2003), we place the following conditional auto- 
regressive (CAR) model on the area specific spatial effects 
Diss (Deh) 


b ~ MVN(0, 2(o;,A)) (6) 


27. 


Be eta Del =A RA) I (7) 


where o; is a spatial dispersion parameter and 2 is a 
spatial autocorrelation parameter, 0<A<1;I is an 
identity matrix of dimension m; R, commonly known as 
the neighbourhood matrix, has i" diagonal element equal 
to the number of neighbors of the area i, and the off-diago- 
nal elements in each row equal to -1 if the corresponding 
areas are neighbors and 0 otherwise. The CAR model (6) - 
(7) corresponds to the following conditional distribution 


of b:: 


X on 

Pes N(; Pui DMN Laer } 
where w,, = 21j.:W,. The CAR model (6) - (7) becomes 
the intrinsic autoregressive model (5) if A =1. On the 
other hand, if 4 = 0, the CAR model (6) - (7) reduces to 
the independent linking model (1) which assumes inde- 
pendence on the area-specific random effects v,. It is 
necessary to point out that the conditional mean and vari- 
ances of b,.| b_, are weighted sums of the corresponding 
overall smoothing moments from the basic linking model 
(1) and local smoothing moments from the intrinsic auto- 
regressive model: 


ys 


E(5,| b_,;) = —————_ x 0 
Lew. 
Aw,, (2 w,b, a 
eae 
1-xz 
\Weve((aL || a. )) = / 
(6, | i) Sear b 
Aw. 5 
+ ——_————_ (o, /w,,). 
aia. kept) 


Thus model (6)-(7) is a balance between the independent 
linking model (1) and the intrinsic CAR model (5). The 
spatial correlation parameter 2 measures the extent of the 
spatial effects for local smoothing of the neighbouring areas. 
The modeling structure (6) captures both the unstructured 
heterogeneity among areas and the spatial correlation effects 
of the neighbouring area. 


2.3. Hierarchical Bayes models and inference 


In order to estimate 0,, the parameter of interest, we 
apply a hierarchical Bayes (HB) approach using the Gibbs 
sampling method. Compared to other approaches such as 
EBLUP and empirical Bayes (EB), HB approach is straight- 
forward and the inference for 0, are exact unlike the EB or 
EBLUP. Moreover, the HB approach can deal with complex 
small area models using the Monte Carlo Markov Chain 
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(MCMC) method, which overcomes the computational 
difficulties of multi-dimensional integrations of posterior 
quantities to a large extent. 

evr yes" (ee eyes) FeO 0), ean) eX 
(x,, ..., X,,)’. We first construct two HB models without and 
with spatial structure under the assumption that the 
sampling variance 6; are assumed known and replaced by 
the smoothed estimate 67. 


Model 1: Fay-Herriot model, denoted as FHM (Fay and 
Herriot 1979; Rao 2003). 
y,| 9; ~ NO, 9; = 6; 


8, B, Oo. ~ N(x B, Gas HOI fh = Nhe wecy HAR 


Ny Wore ff Uh, soo 102 


Priors for the parameters (, Oo): m(B) oc 1; 1G.) ~ 
IG(dp, by), where a,b) are chosen to be very small 
known constants to reflect vague knowledge on o7. N 
stands for the normal distribution and IG for the inverse 
gamma distribution. 


Model 2: Proposed area level CAR model, as an extension 
of the Fay-Herniot model, denoted as CAR-FHM. 
- y|90~ MVN(O, E), where E is a diagonal matrix 
with the i" diagonal element a7 = 67; 


6| B, 0; ~ MVN(XB, 07D"), where D = AR+ 
(1—A)I, with I, an identity matrix of dimension m, 
and R, the neighbourhood matrix; 


Priors for the parameters (f, A, o.): 1(B) o 1; m(A) ~ 
Uniform (0, 1), where 0 < % <1; (07) ~ IG(ap, by), 
where dy, b) are chosen to be very small known 
constants. MVN stands for the multivariate normal 
distribution. 


Note that the proposed model CAR-FHM reduces to FHM 
when the spatial autocorrelation parameter 4 = 0. 

We also consider two HB models with the sampling 
variance 6; unknown and modeled by the direct unbiased 
estimator s°. 


Model 3: You-Chapman Model, denoted as YCM (You and 
Chapman 2006). 
Deals on ~N@O,, Gs HO 1h = Nhe soy AB 
d.s?| o? é Se where d, = n, —1, for i =1,..., m; 
6.) B, a, — NG Bic), forsee mn, 
Priors for unknown parameters (f, 6-,6;,i = 
losers, 70) tt (>) Ocmle 1(o-) ~ IG(a5, 05); (6; ) ~ 
IG(a4,, b,) for i = 1,...,m, where a,b, (0 = 7s .m) 
are chosen to be very small known constants to reflect 
vague knowledge on 6, and o.. 
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Model 4: Proposed area level CAR model with unknown 
sampling variances, as an extension of You-Chapman 
model, denoted as CAR-YCM. 
y | 9, So), eed ro ~ MVN(0, E), where matrix E has 
diagonal elements o°; 


4 ind 


ds: | GaaaG Nee where d;=n,-1, for i= 
gt ens 

6|B, 0. ~ MVN(XB, 67D"), where D = AR+ 
d= E 


Priors for the parameters (f, A, a. fore f= Nh coin HH) 
m(B) oc 1; t(A) ~ Uniform(0, 1), where 0 <A <1; 
1(G-) ~ IGG). b,)s the, = IG, by tor 


1,...,m, Where a,,b, (0 <i<m) are chosen to be 
very small known constants. 


Again, note that the proposed model CAR-YCM reduces 
to the You-Chapman model when A = 0. For both models 
3 and 4 there is an implicit assumption that the area-specific 
sample size n, > 2. If flat priors are used for o;, we 
should have n, 2 4 to ensure proper posteriors (You and 
Chapman 2006). 

We apply the Gibbs sampling method to estimate the 
posterior mean E(0, | y) and the corresponding posterior 
variance Var(0, | y). The required full conditional distri- 
butions of parameters under different models are given in 
Appendix A. For the Fay-Herriot model and the You- 
Chapman model, all the full conditional distributions have 
closed forms and drawing samples from these distributions 
is straightforward. For the proposed two area level spatial 
models CAR-FHM and CAR-YCM, the conditional distri- 
bution of the spatial correlation parameter 1 does not have 
a closed form. We use the Metropolis-Hastings algorithm 
within the Gibbs sampler (Chip and Greenberg 1995) to 
update ~. Under the model CAR-FHM, the full conditional 
distribution of A in the Gibbs sampler can be written as 


[A | @, B, oJ © h(A)f (A) 


where f(A) is a density function of the uniform distribu- 
tion, Uniform (0, 1), given as 


f(A) «1, where 0< 2% <1 


and h(A) is a function given by 


h(a) « [AR eee 


ia 


x en) = (0 — XB)'[AR+ (1 — A)I](0 -xi)} 
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We use f (A) as the “candidate” generating density func- 
tion in the Metropolis-Hastings updating step. To update A 
from the current values of (0, B’, 02”), we proceed as 
follows: 

1. Draw X” from a uniform distribution; 

2. Compute the acceptance probability a(A*, A) = 
min {A(r*)/h(A), 1; 

3. Generate u from a uniform distribution, if wu < 
a(A*, A”), then the candidate value A* is accepted, 
ie, A“) =X"; otherwise X* is rejected, and set 
ee as cae 


For the model CAR-YCM, a similar procedure can be 
applied when drawing samples from the conditional distri- 
bution of 2. 


3. Data analysis 


3.1 Data description and implementation 


The Canadian Community Health Survey (CCHS) is a 
federal survey conducted by Statistics Canada. The primary 
objective of CCHS is to provide timely and reliable 
estimates of health determinants, health status and health 
system utilization across Canada. It is a cross-sectional 
survey which operates on a two-year collection cycle. The 
first year of the survey cycle “*x.1” targets individuals aged 
12 or older who are living in private dwellings, and it is a 
general population health survey with a large sample 
(130,000 persons) designed to provide reliable estimates at 
the health region, provincial and national levels. The second 
year of the survey cycle “x.2” has a smaller sample (30,000 
persons) allocated based on provincial sample buy-ins and is 
designed to provide provincial and national level results on 
specific focused health topics. Although national and 
provincial estimates are very important, there is an in- 
creasing demand for health data at lower levels of geog- 
raphy voiced by a number of provinces including British 
Columbia (BC), Prince Edward Island (PEI), Quebec and 
others. Cycle “‘x.1” of the CCHS collected data corresponds 
to 136 health regions in the 10 provinces and three terri- 
tories. It primarily used two sampling frames. The first one, 
used as the primary frame, was based on the area frame 
designed for the Canadian Labour Force Survey, and within 
the area frame, a multistage stratified cluster design was 
used to sample dwellings. The second frame consists of a 
list of telephone numbers. Random digit dialing metho- 
dology is used in some of the health regions for cost 
reasons. More details of the design are provided in Béland 
(2002). In this paper, we use a small data set from Cycle 1.1 
as an example to demonstrate the analysis. We are interested 
in estimating the disease rate for local health regions within 
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provinces. In particular, we apply the four models discussed 
in section 2 to estimate the asthma rate for 20 health regions 
in the province of BC using the data from Cycle 1.1. Figure 
1 shows the map of the 20 health regions in the province of 
British Columbia. We use this map to define the neigh- 
bourhood correlation matrix used in the spatial models. 
Appendix B gives the list of health regions and related 
spatial structures. 


BC Stats | 
| 


= Sa sae 


Figure 1 Map of 20 health regions in the province of British 
Columbia 


Let ©, denote the true asthma rate for the 7" health 
region in BC, i = 1, ..., 20. From the survey data of Cycle 
1.1, we obtained the direct survey estimate y, of 0; as the 
ratio of number of people having asthma (direct survey 
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estimate) divided by the corresponding population size 
(known constant). We have also included six area level 
auxiliary variables used in the model, and these six variables 
are total population size, number of persons who have 
asthma as one of the symptoms of the chronic disease, 
number of persons who have asthma as the main symptom 
of the chronic disease, number of persons who have diabetes 
as one of the symptoms of the chronic disease, number of 
persons who have diabetes as the main symptom of the 
chronic disease, and number of visits to hospitals. Note that 
in the literature related to disease mapping (e.g., Mollié 
1996; Maiti 1998; MacNab 2003), a Poisson or Binomial 
distribution is usually assumed in the sampling model for 
the direct estimate y,. However, in small area estimation, 
the direct estimate y, is obtained based on the complex 
sampling design used in the survey. Thus, it is a customary 
approach to assume a normal sampling model on the direct 
estimates y,; see, for example, Datta, Lahiri, Maiti and Lu 
(1999), Rao (2003), Mohadjer, Rao, Liu, Krenzke and 
Van de Kerckhove (2007), and You (2008a) . Note that we 
have only considered one kind of disease rate data from one 
province in our study and used this example as illustration 
of the proposed model and evaluate the effects of spatial 
modeling in small area models. 

To implement the Gibbs sampling, we use L =S parallel 
runs each with a “burn-in” length of B = 2,000 and Gibbs 
sampling size of G =5,000. For the proposed models CAR- 
FHM and CAR-YCM, in order to reduce the autocorrelation 
which results from the accept-rejection algorithm in the run, 
we take every 5” iteration after the “burn-in” period. 
Therefore, for models FHM and YCM, we have n = 5,000 
samples for each run, and for models CAR-FHM and CAR- 
YCM, we have n =1,000 samples for each run. Conver- 
gence of the Gibbs sampling is monitored for the small area 
parameters 9, and other unknown parameters in the model 
using the potential scale reduction factor (Gelman and 
Rubin 1992; Gelman, Carlin, Stern and Rubin 2004, page 
296-297). We have computed the reduction factors for all 
the monitored parameters in the model in the Gibbs 
sampling. These factor values are all very close to | (less 
than 1.05), which suggests that the desired convergence for 
these parameters is achieved by the Gibbs sampler. 

We have used vague priors for the hyperparameters in 
the model as a common practice in HB small area esti- 
mation. In particular, the flat prior for regression parameter 
m(B) «1 and proper inverse gamma priors for variance 
components are commonly used (e.g., Arora and Lahiri 
1997; Ghosh etal. 1998; Datta etal. 1999: You and Rao 
2000; Rao 2003, page 237; Souza etal. 2009). Following 
MacNab (2003), we have used the uniform prior (A) ~ 
Uniform(0, 1) for the autocorrelation parameter. The uni- 
form priors are also commonly used for the autocorrelation 
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parameters in spatial models (e.g., Maiti 1998; He and Sun 
2000; Rao 2003, page 266). We also tried several different 
values for the inverse gamma priors. The HB estimates are 
quite stable and not sensitive to the choice of vague proper 
priors. More detailed discussion on sensitivity analysis can 
be found, for example, in You and Chapman (2006) for 
similar models. 


3.2 Comparison of results 


At first, we present the HB estimates of the asthma rate 
under models FHM and CAR-FHM in which the sampling 
variances o; are assumed to be known. We used the 
smoothed estimate 6; obtained by the smoothing technique 
in You (2008a) as described in Section 2. Figure 2 displays 
the direct estimates and the HB model-based estimates 
under FHM and CAR-FHM for the 20 health regions in BC. 
The health regions appear in the x-coordinate ranked by the 
order of sample size with the smallest (Peace Liard) on the 
left and the largest (South Fraser Valley) on the right. Model 
1 (FHM) and Model 2 (CAR-FHM) give similar point esti- 
mates, and both the model-based estimates lead to moderate 
smooth estimates compared to the direct estimates. More- 
over, the direct estimates and two HB estimates of the dis- 
ease rate are very close for some health regions with large 
sample sizes, but for some areas with smaller sample sizes, 
they differ to some extent. Similar results are obtained under 
Model 3 (YCM) and Model 4 (CAR-YCM). 


Comparison of Asthma Rate Estimates 


—* Direct 
FHM 
—=— CAR-FHM 


11 13 LS 17 19 
BC Health Regions ordered by sample size (small to large) 


Figure 2 Direct and HB model-based estimates under models 
FHM and CAR-FHM 


Figure 3 presents the CVs of the direct and two HB 
model-based estimates with the health regions ordered by 
the sample sizes from the smallest to the largest as in Figure 
2. The CVs of HB estimates are obtained by dividing the 
squared root of the posterior variance by the posterior mean. 
As expected, the CVs of the direct estimates show a clear 
tendency of decrease as the sample size increases. However, 
the two model-based estimates give smoother CVs. More- 
over, the two HB model-based estimates exhibit a great 
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improvement over the direct design-based estimates in 
terms of precision and reliability, that is, smaller CVs. 
Compared to the direct estimates, the average CV reduction 
of the HB estimates under FHM is about 22.7% ranging 
from 7.8% to 40.5%, and the average reduction of the CVs 
for the HB estimates under the proposed CAR-FHM is 
27.8% ranging from 12.5% to 52.1%. Thus it is clear that 
the proposed spatial model CAR-FHM is superior to the 
Fay-Herriot model. We also obtained similar results for the 
models YCM and CAR-YCM when the sampling variance 
is modeled directly. The average CV reduction under YCM 
is 23.9%, whereas the average CV reduction is 29.0% under 
the proposed spatial model CAR-YCM. Details of the 
results including the point estimates and the corresponding 
CVs are presented in a table in Appendix C. In our example, 
the sample size at the health region level is relatively large. 
The model-based estimates have still shown great improve- 
ment over the direct survey estimates. Our results indicate 
that the presented small area models can be used to improve 
the direct survey estimates even when the sample size is 
relatively large. Note that Bayesian credible intervals for the 
small area parameters can be easily constructed using the 
MCMC output from the Gibbs sampler if required by prac- 
tical users. This is an advantage of using the HB inference 
via MCMC sampling. However in this paper we only report 
the model-based point estimates and the corresponding CVs 
as Our main purpose is to compare the model-based esti- 
mates with the direct estimates and to show the efficiency 
gain of the models. The gain in efficiency is clearly evident. 


Comparison of CVs 


—+— Direct 
FHM 
—— CAR-FHM 


5 fi 9 MK we SSS vile eal ai) 
BC Health Regions ordered by sample size (small to large) 


Figure 3 Direct and HB CVs under models FHM and CAR- 
FHM 


In order to investigate the effects of incorporating the 
spatial structure in the model, we present the CVs of the 
direct and HB estimates by health regions sorted according 
to the number of neighbouring regions from the smallest (2 
neighbours) to the largest (7 neighbours) in Figure 4. It 
shows that the HB estimates from the proposed model 
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CAR-FHM has smaller CVs than the estimates from the 
Fay-Herriot model. In addition, the improvement of CAR- 
FHM over the Fay-Herriot model is much more obvious in 
the regions with more neighbours, and these two models 
give very close CVs in the regions with less adjacent areas. 
Very similar results are also obtained for CAR-YCM over 
YCM. Table 1 gives the average reduction of the CVs 
across the health regions with the same number of neigh- 
bours. The results in Table | present the CV reduction of the 
proposed spatial models for both cases of known and 
unknown sampling variances. For example, for known o7 
(smoothed oe ), for areas with only 2 neighbours, the 
average CV reduction of model CAR-FHM over the Fay- 
Herriot model is only around 0.9%, whereas for areas with 7 
neighbours, the average CV reduction for CAR-FHM over 
FHM is as high as around 20%. For the case of unknown 
o;, similar results are obtained for CAR-YCM over YCM. 
The numerical results in Table 1 confirm the clear trend of 
increased CV reduction under the proposed spatial model 
over FHM or YCM as the number of neighbours increases. 
Thus, more neighbouring areas can provide more informa- 
tion in the spatial structure to improve the precision and 
reliability of the HB estimates. 


Comparison of CVs 


a, 9 11 13 15 17 19 
BC Health Regions ordered by number of neighbours (small to large) 


Figure 4 Direct and HB CVs under models FHM and CAR- 
FHM with the health regions sorted by the number of 
neighbours 


Table 1 
Comparison of average CV reduction 


Number of Average CV reduction 
neighbours CAR-FHM over CAR-YCM over 
FHM YCM 
2 0.9% 1.8% 
3 3.7% 3.596 
4 6.3% 6.0% 
5 8.9% 8.7% 
6 13.7% 11.0% 
7 19.2% 20.7% 
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3.3. Bayesian model comparison 


In this section, we compare the proposed models CAR- 
FHM with FHM and CAR-YCM with YCM, respectively. 
For hierarchical Bayes model comparison, the deviance 
information criterion (DIC) proposed by Spiegelhalter, Best, 
Carlin and van der Linde (2002) is commonly used in recent 
years to compare non-nested and mixed effects Bayesian 
models. The DIC is based on the deviance of the model 
D(®), which is equal to minus twice the log-likelihood of 
the model, and the DIC is usually computed as DIC = 
D(6) + 2ppy, where D(6) is the deviance of the model 
evaluated at the posterior mean of the model parameters, 
which summarizes the goodness of fit of the model, and p, 
is the effective number of parameters, which captures the 
complexity of the model. p,, is defined as p, = D(0) — 
D(6), and D(@) is the posterior mean of the deviance of 
the model. Thus the DIC is defined as the summation of the 
goodness of fit of the model and the model complexity. 
Smaller values of DIC indicate a better model fit. 
Computation of DIC is relatively straightforward provided 
that the deviance D(9) is available in closed form, and p, 
may be calculated after the Gibbs sampling run by taking 
the sample mean of the simulated values of D(®) minus 
the plug-in estimate of the deviance D(6). For the four 
models presented in section 2, we computed the correspond- 
ding DIC values, as shown in Table 2. It is clear that the 
proposed spatial models CAR-FHM and CAR-YCM both 
have smaller DIC values than the non-spatial models FHM 
and YCM respectively, which indicates that the spatial 
models are better than the non-spatial models in our study. 
Both spatial models CAR-FHM and CAR-YCM perform 
well in this example. This result of model comparison is 


consistent with the estimation results presented in 
section 3.2. 
Table 2 
Comparison of DIC values for the four hierarchical models 
Model DIC value 
FHM 27.1 
CAR-FHM 24.6 
YCM 26.8 
CAR-YCM 24.5 


3.4 Test of model fit 


In order to check the overall model fit of the proposed 
models CAR-FHM and CAR-YCM, we use the method of 
posterior predictive distribution. Let y,,, denote the repli- 
cated observation under the model. The posterior predictive 
distribution of y,,, given the observed data y,,, is de- 
fined as f(Yep| Yons) = J frp | 8) £ | Yons)49. In this 
approach, a test statistic T(y, 0) that depends on the data y 
and possibly the parameter @ can be defined and the 
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observed value 7 (,,,, 9 | ops) Compared to the posterior 
predictive distribution of T(y,,,, | ¥.,;) with any signify- 
cant difference indicates a model failure. Lack of fit of the 
data with respect to the posterior predictive distribution can 
be measured by the p-value of the test quantity (Meng 1994; 
Gelman, Meng and Stern 1996). The posterior predictive p- 
value is defined as p =P(T(y,.,, 9) 2 T (Vor5>9) | Yors)- If 
the given model adequately fits the observed data, then 
T (Vp 9| Yons) Should be near the central part of the 
histogram of the 7(y,.,, 8 | V5) values if y,., 18 generated 
repeatedly from the posterior predictive distribution. Conse- 
quently, the posterior predictive p-value is expected to be 
near 0.5 if the model adequately fits the data. Extreme p- 
values (near 0 or 1) suggest poor fit. The posterior predictive 
p-value model checking has been criticized for being 
conservative due to the double use of the observed data; see, 
for example, Bayarri and Berger (2000). They proposed 
alternative model checking p-value measures, named the 
partial posterior predictive p-value and the conditional 
predictive p-value. However, their methods are more 
difficult to implement and interpret (Rao 2003; Sinharay 
and Stern, 2003). As noted in Sinharay and Stern (2003), the 
posterior predictive p-value is especially useful if we think 
of the current model as a plausible ending poimt with 
modifications to be made only if substantial lack of fit is 
found. 

To carry out the posterior predictive model checking, we 
need to specify a test quantity T(y, 9). You (2008b) studied 
several test quantities in posterior predictive model checking 
for small area models through a simulation study and pro- 
posed a test quantity given as 


obs 


(yo) | max(y,) — mean(0,) | - | min(y,) — mean(@,) |. 


It is shown in You (2008b) that the proposed test quantity 
T(x, 8) is sensitive to the choice of distribution of random 
effects and different mean functions under the Fay-Herriot 
model. A similar test quantity is also suggested in Gelman 
et al. (2004) for posterior predictive model checking. In our 
study, under the proposed model CAR-FHM, the estimated 
p-value is 0.472, and under model CAR-YCM, the esti- 
mated p-value is 0.453. Thus there is no indication of lack 
of model fit and both proposed spatial models fit the data 
quite well. 

To access model fit at the individual observation level, 
we also computed the individual predictive probability 
values p, as p, = P(Vicep) < Vicors) | Yous > S&&, for exam- 
ple, Gelfand (1996) and Daniels and Gatsonis (1999). These 
individual predictive probabilities provide information on 
the degree of consistent overestimation or underestimation 
of the observed data. For model CAR-FHM, the p, ranges 
from 0.325 to 0.768 with a mean of 0.517 and a median of 
0.496; for model CAR-YCM, the p, ranges from 0.316 to 
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0.772 with a mean of 0.511 and a median of 0.497. Both 
models give very similar results and the mean and median 
values are all around 0.5. There is no indication of any 
consistent overestimation or underestimation of the pro- 
posed models. The overall p-values and individual pre- 
dictive probabilities have shown that the proposed spatial 
small area models fit the data quite well. 


3.5 Bias diagnostics 


To evaluate any possible bias of the model-based esti- 
mates under the proposed models with respect to the direct 
survey estimates, following Brown, Chambers, Heady and 
Heasman (2001), we consider a simple method of regression 
analysis for the direct estimates and the HB model-based 
estimates. You (2008a) also used the regression analysis 
method for model bias diagnostics. If the model-based 
estimates are close to the true values of the small area 
disease rate, then the direct survey estimates, which are 
assumed to be unbiased for the true disease rates, should 
behave like random variables whose expected values 
correspond to the values of the model-based estimates. That 
means the model-based estimates should be unbiased 
predictors of the direct estimates. In terms of regression 
analysis, we basically fit the regression model Y = 
a+X to the data and estimate the coefficients, and see 
how close the regression line is to Y = XY. Let Y be the 
direct survey estimates and X be the model-based esti- 
mates. Under the proposed CAR-FHM, we obtain a re- 
gression line Y = —0.0021(0.011) + 1.0365(0.1445)X; 
under the proposed CAR-YCM, we obtain a regression 
line Y = —0.0028(0.0108) + 1.0458(0.1427).¥. Thus 
both the regression lines show very little disparity from 
Y = X. We therefore conclude that the model-based 
estimates are consistent with the direct estimates with no 
extra possible bias induced by the proposed models. The 
results also provide an indication of no evidence of any bias 
due to possible model misspecification. 


4. Conclusions 


In this paper we have discussed two area level models, 
namely, the well-known Fay-Herriot model in which the 
sampling variance is assumed to be known, and the You- 
Chapman model in which the sampling variance is unknown 
and modeled separately by its direct estimator. In both the 
Fay-Herriot model and You-Chapman model, the area 
random effects are assumed to be iid normal random 
variables to capture unexplained area heterogeneity effects. 
After comparing various forms of Gaussian CAR models 
proposed in the literature (e.g., Best et al. 2005) for disease 
mapping to incorporate spatially correlated effects, we 
extended the independent area effects model to a spatial 
correlation model and combined it with the traditional small 
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area models. The proposed new small area spatial corre- 
lation models CAR-FHM and CAR-YCM include the small 
area sampling models and a spatial correlation linking 
model which captures both the unstructured heterogeneity 
among areas and the spatial correlation effects of the 
neighbouring areas. We don’t need to specify the spatial 
autocorrelation parameter in the model, and this parameter 
will be estimated from the data. 

In the data analysis we compared the proposed spatial 
models with the non-spatial effects models by applying the 
models to estimate the rates of asthma for 20 health regions 
in the province of British Columbia. Our results have shown 
that the model-based estimates achieve a great improvement 
over the direct estimates in terms of moderately smoothed 
point estimates and much smaller CVs. Particularly, the 
proposed models are superior to the Fay-Herriot model or 
You-Chapman model whether the sampling variances are 
assumed to be known or unknown. Moreover, note that the 
CV reduction of the proposed spatial models over the Fay- 
Herriot model or You-Chapman model is greater for the 
areas with more neighbours. Results of the Bayesian model 
comparison and model fit analysis are also in favor of the 
proposed small area spatial models. 

In future work, the proposed small area spatial models 
can be extended to unmatched sampling and linking models 
(You and Rao 2002) with the sampling variance known or 
unknown. We plan to evaluate the estimation effects of 
different spatial models as well as the effects of spatial 
structures. For data analysis, we will produce model-based 
health status estimates based on the proposed models for 
health regions across Canada and evaluate the possibility of 
extending the model-based approach to lower level esti- 
mates such as age-sex domains within heath regions. We 
also plan to consider the data cloning method (Lele, Dennis 
and Lutscher 2007; Lele, Nadeem and Schmuland 2010) for 
the spatial models. An advantage of data cloning method is 
that the results are independent of the choice of priors. But 
the computational burden could be considerably extensive. 
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Appendix A 


Full conditional distributions 


A.1. Gibbs sampling full conditional distributions under Model 1: FHM. 


[0, | ,,B, 02] ~ Niy,y, + d —y,)x/B, 67y,], where y, = 02 /(07 +67), for i =1,..., m; 


. [B|®, 02] ~ (86) Ene, |. (Sx) | 


=310,61~ 1G] a, «4 —m, b += 1F@ -xp"| 


i=l 


A.2. Gibbs sampling full conditional distributions under Model 2: CAR-FHM. 


. [6|y, B, A, 0] ~ MVN(Ay + (I — A) Xf, AE), where A = (E'+D/o°)'E™ with E = diag {6/, ..., 62} 
and D=AR+(1-A)I; 


[B|0, A, o. |] ~ MVN[(X'DX) | X’D®, o> (X'DX) |]; 


- (210, B, 02] x |[AR + (1 - 7S) x on) st _XB)'TAR+ (1-10 - xp) 


y 


- [07 |0, B, M16 | ay +% by +5 (0- XB! DO-XB) | 


A.3. Gibbs sampling full conditional distributions under Model 3: YCM. 


- [8,| y, B, 07, 6] ~ N[y, y, + Ud — 7,)x/B, 07 y,], where y, = 62/(62 +07), for i =1,..., m; 


; rectien| (sx) [Ss af “hae §] } 
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A.4. Gibbs sampling full conditional distributions under Model 4: CAR-YCM. 


- [0| y, B, A, 02, 67 ]~ MVN(Ay+(I—A)Xf, AE), where A = (E'+D/o’)'E"', and 


E =diae{o.,...,0,), D=AR# (d= A)L 


On 


. [B| 0, A, o2] ~ MVN[(X'DX) | X'D®, o> (X'DX) |]; 


SS AOSR, occ (ARE ye “P| =XpPY[AR+ (1 — DI] e— x] 
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Appendix B 


List of 20 health regions in the province 
of British Columbia with the corresponding sample sizes and spatial structures 


ID number Health region name Sample size Number of neighbours Neighbours 
if East Kootenay 645 3 De SMS) 
2) West Kootenay-Boundary 705 3 1, 3,4 
3 North Okanagan 890 5 [25 40s LS 
4 South Okanagan Similameen 1,063 4 Db SHO) 
5 Thompson 982 il Oy oy LPS TE) 
6 Fraser Valley 1,125 3) ASS Moco, 
7 South Fraser Valley 1,437 4 (5 th, GS) 
8 Simon Fraser 1,165 5 (5 Sisal, aks 
9 Coast Garibaldi 623 5 Sah tly Ill Hts) 
10 Central Vancouver Island 1,077 2 M5 210) 
11 Upper Island/Central Coast 746 4 3) 5 IO, 12 
12 Cariboo 673 4 Shy lil ilsis tS) 
13 North West 650 3 12, 14, 15 
14 Peace Liard 611 2 133,05 
5) Northern Interior 859 6 1,355, 12,13, 14 
16 Vancouver 1,285 4 17, 18, 19; 20 
17 Burnaby 871 5 7, 8, 16, 18, 19 
18 North Shore 842 4 8, 9, 16, 17 
19 Richmond 828 3} He Galan 
20 Capital 225 2 10, 16 


Note that Vancouver (#16) and Capital (#20) are not adjacent regions in the map since they are separated by the ocean. However, due to the 
intensive and close connection between these two regions, we define them as neighbours in our study for illustration purpose only. 


Appendix C 


Direct and model-based point estimates and CVs 


Comparison of point estimates 


Area ID Direct Est. FHM CAR-FHM YCM CAR-YCM 
] 0.0765 0.0793 0.0812 0.0795 0.0812 
2 0.0804 0.0795 0.0793 0.0797 0.0794 
3 0.0745 0.0726 0.0731 0.0725 0.0729 
4 0.0893 0.0868 0.0874 0.0867 0.0873 
5 0.0782 0.0739 0.0736 0.0729 0.0731 
6 0.0943 0.0914 0.0927 0.0918 0.0928 
i 0.0702 0.0707 0.0712 0.0711 0.0717 
8 0.0858 0.0845 0.0848 0.0844 0.0849 
9 0.0877 0.0763 0.0745 0.0765 0.0747 
10 0.0763 0.0805 0.0799 0.0805 0.0796 
1] 0.0661 0.0685 0.0678 0.0679 0.0676 
12 0.0717 0.0681 0.0681 0.0678 0.0677 
13 0.0631 0.0687 0.0692 0.0690 0.0693 
14 0.0673 0.0685 0.0680 0.0685 0.0686 
15 0.0793 0.0721 0.0707 0.0728 0.0713 
16 0.0657 0.0696 0.0702 0.0697 0.0704 
17 0.0859 0.0778 0.0759 0.0773 0.0759 
18 0.0583 0.0626 0.0633 0.0618 0.0626 
19 0.0619 0.0649 0.0647 0.0653 0.0647 
20 0.0877 0.0923 0.0914 0.0917 0.0908 

Comparison of CVs 

Area ID Direct Est. FHM CAR-FHM YCM CAR-YCM 
1 0.168 0.107 0.099 0.107 0.100 
2 0.127 0.105 0.104 0.097 0.093 
3 0.135 0.116 0.106 0.110 0.097 
4 0.102 0.084 0.076 0.079 0.072 
5 0.158 0.094 0.076 0.105 0.083 
6 0.113 0.086 0.080 0.086 0.081 
Al 0.124 0.099 0.096 0.106 0.101 
8 0.102 0.085 0.076 0.081 0.073 
9 0.158 0.119 0.105 0.117 0.105 
10 0.121 0.087 0.086 0.086 0.084 
11 0.141 0.118 0.108 0.109 0.105 
2 0.196 0.119 0.109 0.130 0.116 
ifs) 0.168 0.115 0.108 Oat aa 0.108 
14 0.206 0.126 0.125 0.136 0.133 
15 0.121 0.101 0.087 0.094 0.083 
16 0.127 0.101 0.097 0.103 0.097 
17 0.124 0.107 0.100 0.105 0.096 
18 0.155 0.143 0.136 0.134 0.130 
19 0.154 0.135 0.134 0.128 0.128 
20 0.103 0.086 0.085 0.083 0.082 
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Small area estimation under transformation to linearity 


Hukum Chandra and Ray Chambers ' 


Abstract 


Small area estimation based on linear mixed models can be inefficient when the underlying relationships are non-linear, In 
this paper we introduce SAE techniques for variables that can be modelled linearly following a non-linear transformation. In 
particular, we extend the model-based direct estimator of Chandra and Chambers (2005, 2009) to data that are consistent 
with a linear mixed model in the logarithmic scale, using model calibration to define appropriate weights for use in this 
estimator. Our results show that the resulting transformation-based estimator is both efficient and robust with respect to the 
distribution of the random effects in the model. An application to business survey data demonstrates the satisfactory 


performance of the method. 


Key Words: Sample survey; Survey estimation; Business surveys; Model calibration; Skewed data; Model-based direct 
estimation; Empirical best linear unbiased prediction. 


1. Introduction 


Commonly used methods for small area estimation 
(SAE) assume that a linear mixed model can be used to 
characterize the regression relationship between the survey 
variable Y and an auxiliary variable X in the small areas of 
interest. In particular, empirical best linear unbiased 
prediction (EBLUP), see Rao (2003, chapters 6-8) is 
typically based on a linear mixed model assumption. 
However, when the data are skewed, as is often the case in 
business surveys, the relationship between Y and _X may 
not be linear in the original (raw) scale, but can be linear in 
a transformed scale, e.g., the logarithmic (log) scale. In 
such cases we would expect estimation based on a linear 
mixed model for Y to be inefficient compared with one 
based on a similar model for a transformed version of Y. 
See Hidiroglou and Smith (2005). The use of transforma- 
tions in inference has a long history, see for example 
Carroll and Ruppert (1988, chapter 4). Recently, Chen and 
Chen (1996) and Karlberg (2000a) have investigated the 
use of a ‘transform to linearity’ approach for regression 
estimation of survey variables that behave non-linearly. 
However, to the best of our knowledge there has been no 
application of this idea in SAE, even though economic 
theory (and casual observation) suggests that regression 
relationships in business survey data are typically multi- 
plicative, and hence linear in the log scale. 

In this paper we extend the model-based direct (MBD) 
estimation ideas described in Chandra and Chambers 
(2005, 2009) to the situation where the linear mixed model 
underpinning SAE holds on the log scale, using weights 
derived via model calibration (Wu and Sitter 2001). In 
doing so, we note that our approach easily generalises to 


other monotone (i.e., invertible) transformations. In 
contrast, extension of the EBLUP approach to where the 
data follow a linear mixed model under transformation is 
complicated. We also relax the usual normality assumption 
for the area effects in order to examine robustness with 
respect to this assumption. 

In the following section we summarise the MBD 
approach to SAE under a linear mixed model. In section 3 
we describe an alternative to the linear mixed model for 
skewed data which reduces to the linear mixed model 
under log transformation, and in section 4 we use a model- 
based perspective to motivate model calibrated estimation 
of population quantities where the underlying variable is 
linear after suitable transformation. In section 5 we bring 
these two ideas together, introducing the concept of a fitted 
value model derived from a linear mixed model in the 
transformed scale. We then use this fitted value model to 
specify survey weights for use in an MBD estimator in 
SAE. In section 6 we present empirical results from a 
number of simulation studies that contrast the proposed 
transformation-based MBD estimator with both the 
EBLUP and the ‘usual’ MBD estimator defined by fitting 
a linear mixed model to the data as well as with an indirect 
empirical predictor based on the same transformed scale 
linear mixed model. Section 7 concludes the paper with a 
discussion of outstanding issues. 

Note that the approach taken in this article is model- 
based. Consequently all moments are evaluated with 
respect to a model for the population data. Also, all sample 
data are assumed to have been obtained via a non- 
informative sampling method, e.g., probability sampling 
with inclusion probabilities defined by known model 
covariates. 
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2. Model-based direct estimation for small areas 


To start, we fix our notation. Let U denote a population 
of size N and let y,, denote the N-vector of population 
values of a characteristic Y of interest. Suppose that our 
primary aim is estimation of the total 4, = Ly y, of these 
population values (or their mean m,,, = N Sos, y,). Let X 
denote a p-vector of auxiliary variables that are related, in 
some sense, to Y and let x,, denote the corresponding 
N x p matrix of population values these variables. We 
assume that the individual sample values of X are known. 
The non-sample values of X may not be individually 
known, but are assumed known at some aggregate level. At 
a minimum, we know the vector of population totals t,,, of 
the columns of X. 

Suppose that it is reasonable to assume that the 
regression of Y on X in the population is linear, i.e., 


E(y,| Ky) = Xpp and Var(y,,| Xy) = Vy (1) 


where v,, is known up to a multiplicative constant. Given a 
sample s of size n from this population, we can partition 


and 


into their sample and non-sample components. Here 
r =U-—s denotes the population units that are not in 
sample. The vector of weights that defines the Best Linear 
Unbiased Predictor (BLUP) of ¢,,, is then (Royall 1976; 
Valliant, Dorfman and Royall 2000, section 2.4) 

Wes = (w 


AY 


BLUP, 


ae de) 


=1,+H‘(t,,-t,.) +, —Hix')vo'v, 1, (2) 
where H, = (x’ v,'x,) ‘x’ v,", I, is the identity matrix of 
order n, t,. is the vector of sample totals of X and 1, (1,) 
denotes a vector of ones of size n (N — n). 

We now assume that the target population U of size N 
can be partitioned into D non-overlapping small areas or 
domains, each of size N.,i=1,...,D, such that N = 
¥?,N,. Given a sample s of size n units is drawn from this 
population, we shall assume that a sub-sample s, of size n, 
units is drawn from area i, with n = ¥?\n,. Note that we 
assume that all small areas are sampled and that there is at 
least one sample unit in each small area of interest. 
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As noted in section 1, linear mixed models are often used 
in SAE. Such models can be written in the form 


Yy =XyB+gyutey (3) 


where u is a random vector of so-called area effects, e,, is 
a population N-vector of random individual effects and g,, 
is a known matrix. In general, area effects are vector-valued, 
sO UW =(U,U,---u,)) and ¢ = dias{e 7 = 1), 
where g, is of dimension N, x qg. The area specific effects 
{u,; i=1,..., D} are assumed to be independent and 
identically distributed realisations of a random vector of 
dimension g with zero mean and covariance matrix 2,. 
Similarly, the scalar individual effects making up e,, are 
assumed to be independent and identically distributed 
realisations of a random variable with zero mean and 
variance o~, with area and individual effects mutually 
independent. The parameters 0 = (Z,,02) are typically 
referred to as the variance components of (3). 

Given the values of the variance components, it is 
straightforward to see that (3) is just a special case of the 
general linear model (1) that underpins the BLUP weights 
(2). In particular, under (3) 


Vg = Magty,cc t= |... D} 
Stdiagigt)) gis ol. T= 14) Demme) 
and 
V,, = diag{v,,.; 7 =1,..., D} 
=idiag{g),2y B, si i= ae: Dyan) 


Here g, and g,, denote the restriction of g, to sampled 
and non-sampled units in area i respectively. Given esti- 
mated values 6 = (2,, 62) of the variance components we 
can substitute these in (4) and (5) to obtain estimates V.. 
and V,. of v,, and v,. respectively, and therefore compute 
‘empirical?’ BLUP weights, or EBLUP weights for the 
population total of Y as 

OS SEs Rah Sas BD) 


EBLUP EBLUP 
Ss ma (w; 
ir: ry 
= 1, “8 H, (ty, a te) 
whet \a-la 
iG se Higxe) Veavieuls (6) 
ral 


where H, = (x/¥ 


Ss Ss 


= ne 
x,) ‘x’ ¥). Note that we now use a 


double index of ij to differentiate between population units 
in different areas. 

The MBD estimator for the mean m,, of Y in area i 
(Chandra and Chambers 2005, 2009) based on the EBLUP 
weights for the total (6) is simply the corresponding 
weighted average of the sample values of Y in area 1, 
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=H 
~ HJ-LinMBD EBLUP EBLUP 
Mey ? {> ea i be ictal foe slic wel 
Note that (7) is not the EBLUP for May 
(see Rao 2003, section 6.2.3) 


~ HT-LinEBLUP 
mM, 


under (3). This is 


= Efmy] yicoXi0%in} 
= is Sr Das I, {x,,B Hh ae Can x,.))} | 
sa dat | nF 

(¥,B + 


Here E denotes the expectation operator under (3) with 
unknown parameters replaced by estimates, x,, and x,. are 
the matrices of sample and non-sample values of X in area 
i, y;, 1s the vector of sample values of Y in the same area, 
is the ‘empirical’ BLUE of f, V,,, is the transpose of the 
estimated value of v,. with V,. the corresponding estimate 
of v,,,, see (4) and (5), and 1,, is a vector of ones of length 
N, —n,. Note that the last expression on the right hand side 
of (8) follows directly by substitution of (4) and (5), with 
x, and g,, denoting the column vectors of order p and q 
defined by averaging the columns of x, and g,. respec- 
tively. Like the EBLUP (8), the estimator (7) is a weighted 
function of all the sample values. Note that under random 
intercept specification of (3), (8) reduces to the expression 
(7.2.39) in Rao (2003, section 7.2). 

Mean squared error (MSE) estimation for (8) is usually 
carried out using the theory described in Prasad and Rao 
(1990). Although this MSE estimator is somewhat compli- 
cated, it works well under (3). However, when (3) fails it 
can be misleading. It is also inadequate as an estimator of 
the repeated sampling MSE of (8), as has been pointed out 
by Longford (2007). In contrast, MSE estimation for (7) is 
quite straightforward. This is because if one treats the 
weights defining this estimator as fixed, then it is a linear 
estimator of a domain mean, and so its prediction variance 
V, under (1) can be estimated using well-known methods 
(see Royall and Cumberland 1978). Since in general the 
EBLUP weights for the total (6) are not ‘locally calibrated’ 
(i.e., they do not reproduce the area i mean X, of X), (7) has 
a bias B, under (1). A simple plug-in estimate of this bias is 
the difference between (7) and ¥/f. The final MSE 
estimator used with (7) is therefore defined by summing the 
estimate of V, and the square of this estimate of B,. This 
method of MSE estimation has been empirically demon- 
strated to have good model-based as well as repeated 
sampling properties. See Chandra and Chambers (2005, 
2009), Chambers and Tzavidis (2006), Chandra, Salvati and 


TiN tp) 


+85, gi, (8,284 + 6: L;)” Ys = x,.B) | sy 
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Chambers (2007) and Tzavidis, 
Chambers (2008). 


Salvati, Pratesi and 


3. Small area estimation under transformation 


In this section we extend the MBD approach to SAE 
when the underlying regression relationships are non-linear. 
In doing so, we shall focus on the important case where the 
population values of Y follow a non-linear model in their 
original (raw) scale, but their logarithms can be modelled 
linearly. The extension to other ‘transform to linear’ models 
is straightforward. 

Without loss of generality, suppose that both Y and_X are 
scalar and strictly positive, with skewed population mar- 
ginal distributions and clear evidence of non-linearity in 
their relationship, e.g., as in many business surveys ap- 
plications. Furthermore, a linear mixed model is appropriate 
for characterising how the regression of log(Y) on log(X) 
varies between the small areas. That is, for i = 1,..., D: 
j =1,..., N, we have 


= log(y,) = By + B, log(x,) + g,u, +e, (9) 


where y,, and x,, are the values of Y and X respectively for 
population unit j in small area i, g,, denotes a ‘contextual’ 
covariate of dimension g, u, denotes a random effect for 
area 1 also of dimension q and e, is a scalar individual 
random effect. As usual with this type of model, we assume 
that all random effects are normally distributed and mutu- 
ally uncorrelated, me zero expected values, Var(u,) = = 
and Var(e,) = O.. Here X, is the g x q matrix of covari- 
ances for che random Sse Note that Var(/;, | x; ) = 
Vij = By Xu By +o, and Cov (i, 1, | Xs Xns By» B ge) = = 
vk = Bj Biz under (9). 

Given sample values of y,,.x, and g,, standard 
methods of estimation (e.g., ML or REML, see Harville 
1977) can be used to estimate the parameters of (9). Let pa 
and 6. denote the resulting estimates of the variance 
components of this linear mixed model. The estimate of 


cear(es oy is then 
B=(dja,oid,) (Sia,eu1,) 9) 


where ¥V,,,,d,, and 1, are the sample components of 
¥, = [0,1] = 9,2, 9) + 621, d, =[d,,] = [1, log(x,)] and 
Ke= G5 J = la Nj) respectively. Here'g, is the 
N, * q matrix defined by the covariates g,, in area i, I, is 
the identity matrix of order N,, 1, denotes a vector of ones 
of dimension N, and log(x,) denotes the vector of N, 
values of log (X) in area i. 

Note that when the variance components ©, and oy, are 


known, (10) is the BLUE for B. Conreducntly, E(B) = B 
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and Var(B) ~ (X.di,¥,.d,,)'- Put 6,=(@,)=4,B. Then 
E(o,)~d,B and Var(b,) = A = lal 0; Oed yee 
d’, where a,, = di,  Var(B)d,, —> 0 asn > ~. 

Our aim is to use ‘the log scale linear mixed model (9) for 
estimation of the small area means m,,. In particular, we use 
model calibration (Wu and Sitter 2001) based on this model 
to develop sample weights for use in the MBD estimator (7) 
of this quantity. 


4. Model calibrated weighting 


Model calibration was introduced by Wu and Sitter 
(2001) as a model-assisted method of calibrated weighting 
when the underlying regression relationship is non-linear. 
Here we provide a model-based perspective on the method, 
as a precursor to using it for constructing weights for use in 
an MBD estimator in a similar situation. 

Suppose that the underlying population model is non- 
linear, with the relationship between Y and X in the 
population of form 


E(y,| x;) = A(x,; n) and Var(y,|x;) = Gy. (11) 
Here 7 =1,..., N, n (typically vector-valued) and Oo; are 
unknown model parameters and the mean function 
h(x,;7) 1s a known function of x, and n. We also 
assume that population units are mutually uncorrelated 
given their respective values of X. Note that (11) is quite 
general, and includes linear, non-linear, and generalized 
linear models as special cases. In this situation, Wu and 
Sitter (2001) define the model-calibrated estimator of the 
population total 7, as il Oa wi y,, Where the vector 
of weights w/" = (u is chosen to minimise an 
appropriately chosen measure of the distance from w*”" to 
the vector of Horvitz-Thompson weights w* = (ae ii 


subject to the model calibration constraints 
Dire ve =e 
and (12) 
ae wi h(x 371.) = esa 


with n, a design consistent estimator of 1. Note that 
unlike standard calibration, the constraints (12) require that 
we know the individual population values of X. The key 
idea behind this approach is that provided (11) fits 
reasonably, then y, is (at least approximately) a linear 
function of its fitted value h(x; f,,) under this model and 
sO we can carry out linear estimation using these fitted 
values as auxiliary information. 
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A model-based perspective on model calibration can be 
developed as follows. Let denote a ‘model-efficient’ 
estimator of 7 in (11), e.g., its maximum likelihood (ML) 
estimator, with associated fitted values A(x; 7). In general, 
these fitted values will not be unbiased. They will also be 
correlated. However, there will still be a systematic relation- 
ship between the actual values of Y and their corresponding 
fitted values that we can approximate. Although there is 
nothing to stop us looking at more complex approximations, 
a linear model for the relationship between the population 
values y, and the fitted values p, = h(x,; 1) seems a 
reasonable starting point. We therefore feoplace the non- 
linear model (11) by the linear model 


E(y,| Jj) =A + ay, 


and (13) 


Cov(y, Vel Vo) = Ojx- 

We refer to (13) as the ‘fitted value’ model corresponding to 
(11). Let J,, denote the population ‘design matrix’ under 
(13), ie, Jy =[ly yy], where 1,, denotes the unit vector 
of size N and yy =(9,;; fj =1,...,N), and put Q, = 
[lemegs A eile: BONS hia lee NI We can then partition 
J, and 9, faccouting to sample (s) and non-sample (7) 
units as 


and 


OFF Oe 
Q,, = 
ee Q,. 


and hence write down the weights that define the BLUP of 
ty, under (13). These are the model-based model-calibrated 
weights 


We i = s) 
=1,+H!,, (J, 1y-Ji1,)+0,-H,, 9 QO, 1, (14) 
Where He =(0Ond je dee Soe these weights are 


model-calibrated since SS wn = N and Dje. weme 5 d= 
djcu ¥;- However, unlike the lineat model EBLUP weights 
(2), they are not calibrated on X. In practice, the compo- 
nents of Q,, will not be known and will need to be 
estimated. When these estimates are substituted in (14), 
we obtain the empirical version w°”””” of these model- 


calibrated weights. 


jes W 
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5. Model calibrated weighting for 
small area estimation 


We now use model calibration based on the log scale 
linear mixed model (9) to obtain sample weights for use in 
the MBD estimator (7). From the development in the 
previous section it can be seen that this requires us to first 
specify a fitted value model (13) for Y based on (9), i.e., we 
need to calculate appropriate fitted values J, as well as 
estimates @,, of @, =Cov(),,, Ya Xijs Xins g,.2 g,,) under 
(9). The sample weights to use in the MBD estimator (7) are 
then given by (14). 

A simple method of defining fitted values J, under (9) is 
one where parameter estimates derived under this model are 
used to obtain predicted values on the log scale which are 
then back-transformed. Unfortunately, as is well known, this 
approach is biased. We therefore develop the first and 
second order moments of an appropriate bias-corrected 
fitted value model based on (9). Let x, and g, denote the 
sample values of x,, and g,, respectively. Under (9), 


E(y,, At G,,) = = Efe! ee g,} = phe, gua 


# E(et fw? |x g,) = 2(9,| x, g,) 


so the usual bias correction that makes use of the fact that 
the conditional distribution of y, 18 lognormal is ane 
equate. Let 9, = = (f, ¥,,)' be an estimate of n,; = @, v,,) 

suelatt that E(Ay - ny) ~0 for large n. Put z(n,)= 
eft! . Using a second order Taylor series approximation 


we can write 


ZN) 2 Neder ocate) 2. (T,) 


i. ; x 
P etl - Ni) Ze aly - Ni) 
and so 


E{z(n,)} * Zn) 


! A (2) A n ' 
V3 me [E{z (ni) Ny - ny) Oy a Ny) }]. 
Here 
(1) 1 o,+,/2 1 o,4v,/2 
a (n,;) Fy ‘aye eet —e” * 
; y 
and 
d d’ o;, +v,,,/2 lie by +¥ 4/2 
goye 5 vue 
Gis) = 
] +4 oD, +y,../2 
Sgt ctrl? 1 rye! 
Diet 4 
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are the vector and matrix respectively containing the first 
and second order derivatives of z( n,) with respect to n,. 
Since the asymptotic covariance between ML (or REML) 
estimators of the fixed and variance components of a linear 
mixed model is zero (McCulloch and Searle 2001, chapter 
2, pages 40 - 45), the covariance between 6 and ¥,,, will be 
negligible. It follows that 


EZ (Ns) (ty 7" ny) (Ny 


= ir[z (n EAy — 


y ny) 3] 


ny — 1,) 31 


: if 
* ay Hpandag god alk d,,+ r¥artey)] 


a 


= E(y, lie gi 4, =f $¥art%y)| 


where 4;,,= di, V(p)d, and V(B) = (x, d/,¥;'d,,)" is the 
usual estimator of Var(§). Our fitted values are therefore 
defined by the second order bias corrected estimator of 


E(y, ee 8; )> 
§, =h(d,; A) = Blew? (15) 


where 


and Vion, .) is the estimated asymptotic variance of v, 
Under ML and REML estimation of the variance eatin 
nents of (9), this estimated asymptotic variance is obtained 
from the inverse of the relevant information matrix. Note 
that the bias adjustment of Karlberg (2000a) is a special case 
of (15). 

In order to use (14) to define model-based model- 
calibrated sample weights, we also need estimates of the 
second order moments of the population values of Y given 
these fitted values. The conditional moments @,, are a first 
order approximation to these moments. In particular, given 
normal random effects 


, Par /2 ) 
Oy Le PAL. + (Vig + Vink ) (e z 1) (16) 


Ou estimate ©,, of @,, 1s obtained by substituting b, and 
for $, and v,, in (16). 

The enapitical model-based model-calibrated weights 

(14) corresponding to the fitted value model defined by (15) 

and (16) are 


Din 


embmc __ (w jembme , 
Bae 


SIRE GS. Pile see) 
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mc 


= gh) 
Mie Hee OO 1. (17) 
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Here J,, = [ly Yz]. so 


N-n 
S4/ab = ; 


Ul 
Jy 1, ek) A 
2 oes Vij 


and A, = (S025, 7, Q2. Also Q,, = diag {Q,,,; 
r=), Dy and Oo = diag{ Oe Ste vere 
OF 


and ©, are defined by the sample/non-sample 
decomposition of Q. For example, when (9) corresponds 
to a random intercepts specification, V,,. = 6 +G:1(j =) 
and so the components of , are 


jy = ere oH 1 4 TF = (e* —1)} - 1. 


The development so far has assumed normality of log- 
scale random effects. However, there is no good reason 
(beyond convenience) to assume that with skewed data 
these random area effects should be normal. One alternative, 
given a scalar area effect in (9), is to assume that the random 
effects in this model are drawn from the gamma family of 
distributions. From the properties of this distribution and 
using binomial and exponential expansions (ignoring higher 
order terms) we can show that E (y,| Ze Be) = eur = 
z(n,,) as in the normal case. This indicates that an MBD 
estimator based on the model-based model-calibrated 
weights (17) should be robust with respect to the distribu- 
tion of the random effects in (9). 

Finally, we consider definition of the MBD estimator 
itself. As noted in section 2, this estimator is just the 
weighted average of the sample Y-values in an area. 
However, use of such a weighted average pre-supposes that 
the weights are reasonably close to being ‘locally calibrated 
on N’, 7.e., when summed over the sample units in small 
area i we obtain a value that is not too different from the 
actual small area population size N,. This property usually 
holds if the weights are the EBLUP weights for the total (6) 
defined by a linear mixed model for Y. It does not 
necessarily hold for the model-based model-calibrated 
weights (17). Consequently, we consider two specifications 
for the MBD estimator given these weights. The first, which 
we refer to as a ‘Hajek specification’, is just the weighted 
average (7), with weights defined by (17). The second, 
which we refer to as a ‘Horvitz-Thompson specification’, 
replaces the denominator in (7) by the actual value of N,. 
That is, the two types of MBD estimator under model-based 
model-calibrated weighting that we consider are 


| 
~ HJ-TrMBD embmc . embmc 
We = > ye Ds Yes 
M,, JES; Wi JES; Wi J y (18) 


and 


~HT-TrMBD _ —. embme 
m,. uN De Wa” Vee (19) 
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Alternatively we can adopt a prediction-based approach 
to obtain an alternative indirect predictor for the small area 
mean under the log-transformed model (9). Our approach 
extends that of Karlberg (2000a). In this case, assuming 
model (9) holds, we predict each nonsample Y in small area 
i and then sum these predictions. Note that we need to 
correct for bias following back-transformation to the raw 
scale when calculating these predicted values for the 
nonsample Y. Under model (9), the resulting empirical 
predictor for the mean m,, of Y in area i (denoted TrEP) can 
be defined as i 


a TrEP  __ —| a 
My e N; ie Vij ie » jer, Vij 3 (20) 


where y,, is given by (15). 

Estimation of the MSE of (18) and (19) is carried out in 
the usual way for MBD estimators, i.e, via the MSE 
estimation approach described in section 2. Estimation of 
the MSE of (20) is not straightforward since this predictor is 
a non-linear function of Y values. We do not pursue this 
issue in this paper. 


6. An empirical evaluation 


In this section we provide empirical results on the 
comparative performances of five different methods of 
SAE. These are the two ‘transformation-based’ MBD esti- 
mators (18) and (19), both based on the model-based model- 
calibrated weights (17) and denoted by HJ-TrMBD and HT- 
TrMBD respectively; the log-transformation based predictor 
(20) under model (9), denoted TrEP, the ‘standard’ MBD 
estimator (7) based on the linear mixed model (3) and the 
empirical EBLUP weights for the total (6), which we 
denote by HJ-LinMBD to emphasise that it is a Hajek-type 
weighted mean based on weights derived under a linear 
mixed model; and the EBLUP (8) derived under the same 
linear mixed model, which we denote HT-LinEBLUP. Note 
that the MSEs for all three MBD estimators were estimated 
using the method described in section 2, while the MSE of 
HT-LinEBLUP was estimated using the method described 
in Prasad and Rao (1990). Note that we have not considered 
estimation of the MSE of TrEP. 

Our empirical results are based on two types of simu- 
lation studies. The first type used model-based simulation 
to generate artificial population and sample data. That is, at 
each simulation population data were first generated under 
the model and a single sample was then taken from this 
simulated population by stratified simple random sampling 
without replacement with small area as strata. These data 
were then used to compare the performances of the 
different estimators. In section 6.1 we present the results 
from these model-based simulations. We carried out two 
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sets of model-based simulations. In the first set of simu- 
lations (Set A), we investigated the performance of these 
estimators given population data generated using the log- 
scale linear mixed model (9). In second set of simulations 
(Set B), we examined the robustness of these estimators to 
misspecification of this model. The second type of simu- 
lation study was design-based. In section 6.2 we describe 
design-based simulations. Here we evaluated these esti- 
mators in the context of repeated sampling from a real 
population using realistic sampling methods. That is, real 
survey data were first used to simulate a population, and 
this fixed population was then repeatedly sampled ac- 
cording to a pre-specified design. In particular, the sample 
design used was stratified random sampling with strata 
corresponding to the small areas of interest and with 
stratum allocations set to the small area sample sizes in the 
original datasets. 

Four measures of estimator performance were computed 
using the various estimates generated in these simulation 
studies. They were the relative bias (RB) and the relative 
root mean squared error (RRMSE) of these estimates, 
together with the coverage rate and average width of the 
nominal 95 per cent confidence intervals based on them. In 
Tables 2 to 4 these measures are presented as averages over 
the small areas of interest. 


6.1 The model-based simulation study 


Model-based simulations are a common way of illus- 
trating the sensitivity of an estimation procedure to variation 
in assumptions about the structure of the population of 
interest. Here we fixed the population size at N = 15,000 
and randomly generated the small area population sizes 
N,,i =1,...,.D=30 so that >;N, = N. We used an 
overall sample size of n =600 with small area sample 
sizes set so that they were proportional to the corresponding 
small area population sizes. These area-specific population 
and sample sizes were kept fixed in all our simulations. The 
population and sample sizes are given in Table la. 


Table la 
Area specific population (V;) and sample (n;) sizes for model- 
based simulation 


Area l 2 3 4 5 6 a 8 9 10 
N. SEY SSi3 SOY AS Se A SIG CY Spey silts 
Nj “ub Dil DI BO MWD Bik. Mi? eri (sy ike wl 


Area ieee al? eal See A Sieh Gels SIM 19920 
N; 502 524 509 484 487 459 542 498 512 500 
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In Set A of our model-based simulations the population 
values y, were generated using the multiplicative model 
yy = 5.0xi u; (J =1,.... Nj; i =1, ..., 30), with random 
samples then taken from each small area. Here the values of 
x, were independently drawn from the log-normal distribu- 
tion log(x,) ~ N(6, o.), with the individual effects and 
area effects independently drawn as log(e,) ~ N(0, co.) 
and log(u;) ~ N(0, G.) respectively. The population 
values of x were re-generated in each simulation. In 
particular, in each simulation we first generated the values of 
x’s for a population of size N and then randomly assigned 
these values to different areas of sizes N,. The values of o, 
and oc, were chosen so that the intra-area correlation in the 
population varied between 0.20 and 0.25. Table 1b shows the 
six different sets of parameter values that were used in Set A. 
These ensured that the simulated populations contained a 
wide range of variation. For each generated population and 
for each area i we selected a simple random sample (with- 
out replacement) of size n,, leading to an overall sample size 
of n =600. The sample values of y and the population 
values of x obtained in each simulation were then used to 
estimate the small area means. That is, using the sample data 
in each case, parameter values were estimated using the /me 
function in R (Bates and Pinheiro 1998), and estimates for 
the small area means then calculated, along with appropriate 
nominal 95% confidence intervals. The process of generating 
population and sample data, estimation of parameters and 
calculation of small area estimates was independently repli- 
cated 1,000 times. The results from this part of the simulation 
study are shown in Table 2. 


Table 1b 

Population specifications for model-based simulation Set A 

Parameter Set B GC, Oo, on 
i 0.5 0.30 0.50 3.00 
zZ 0.8 0.35 0.60 2.50 
3 1.0 0.40 0.70 25 
4 IES 0.45 0.80 leis 
5 RS 0.50 0.90 1.50 
6 2.0 0.60 1.00 1.20 


In Set B of the model-based simulations, population data 
were generated using the model y, = 5.0x, [exp (log’ 
(x, ))]" u,e,. Here the individual effects ey and the area 
effects uw, were independently drawn as log(e,,) ~ N(0, 1) 
and log(u,) ~ N(0, 0.25) respectively, while the covariate 
values x, were drawn as log(x,) ~ N(3, 0.04). Five 
different values for the parameter y (-1.0, -0.5, 0.0, 0.5, 1.0) 
were investigated, thus generating population data with 
different degrees of curvature. All other aspects of these 
simulations, including the estimators considered, were the 
same as in Set A. Table 3 presents results from this 
component of the simulation study. 
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Table 2 


Average relative bias (ARB), average relative RMSE (ARRMSE), average coverage rate (ACR) and average interval width (AW) for 


model-based simulation Set A 


Criterion Estimator Parameter Set 
1 2 3 4 5 6 
ARB,% HJ-TrMBD -82.68 -95.02 -98.08 -98.50 -98.29 -99.00 
HT-TrMBD 0.09 0.10 -0.14 (25 -0.03 0.04 
TrEP 0.08 0.09 20:18 -0.48 -0.05 0.01 
HJ-LinMBD 12.01 4.09 apes -5.54 -6.60 -9.88 
HT-LinEBLUP 13.39 5.18 -0.67 -5,24 -6.41 -9.67 
ARRMSE HJ-TrMBD 4.80 1.39 1.25 1.44 1.42 1.62 
HT-TrMBD 0.15 0.26 0.45 0.64 0.66 0.91 
TrEP 0.30 0.41 0.58 0.80 0.81 1.09 
HJ-LinMBD ia 1.41 1.85 1.99 2.06 2.69 
HT-LinEBLUP 0.79 0.54 0.64 0.92 0.93 1.31 
ACR HJ-TrMBD 0.99 0.98 0.97 0.95 0.94 0.92 
HT-TrMBD 0.94 0.91 0.89 0.89 0.89 0.88 
HJ-LinMBD 0.87 0.85 0.85 0.88 0.88 0.87 
HT-LinEBLUP 0.85 0.85 0.86 0.87 0.88 0.87 
AW HJ-TrMBD 1,592 22,688 140,452 52 x10* 35 x10° 44 x10° 
HT-TrMBD 219 4,414 34,105 14 x10* 11 x10° 15 x10° 
HJ-LinMBD 1,005 19,232 139,420 57 x104 41 x10° 56 x10° 
HT-LinEBLUP 382 7,099 57,039 26 x104 VvseliOn 32 x10° 
Table 3 


Average relative bias (ARB), average relative RMSE (ARRMSE), average coverage rate (ACR) and average interval width (AW) for 


model-based simulation Set B 


Criterion Estimator y=—1.0 y =-0.5 y = 0.0 y= 0:5 y =1.0 
ARB,% HT-TrMBD 4.92 0.66 0.14 -1.50 -8.75 
HJ-LinMBD -0.21 0.04 0.12 0.16 -0.85 
HT-LinEBLUP -0.19 0.04 0.13 0.17 -0.77 
ARRMSE HT-TrMBD 0.38 0.35 0.33 0.37 0.41 
HJ-LinMBD 0.56 0.36 0.34 0.53 1.20 
HT-LinEBLUP 0.3 0.30 0.29 0.36 0.56 
ACR HT-TrMBD 0.94 0.92 0.92 0.91 0.87 
HJ-LinMBD 0.91 0.92 0.92 0.92 0.90 
HT-LinEBLUP 0.93 0.94 0.94 0.93 0.92 
AW HT-TrMBD 0.04 2.50 AM 29,070 StU: 
HJ-LinMBD 0.06 2.70 214 38,660 13 x10° 
HT-LinEBLUP 0.05 2.60 214 33,442 10 x10° 


6.2 The design-based simulation study 


This study used the same population and samples as the 
simulation studies described in Chandra and Chambers 
(2005) and Chambers and Tzavidis (2006), which was 
based on data obtained from a sample of 1,652 farms that 
participated in the Australian Agricultural and Grazing 
Industries Survey (AAGIS). A realistic population of 81,982 
farms was defined by sampling with replacement from the 
original sample of 1,652 farms with probabilities propor- 
tional to their sample weights, all of which were strictly 
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greater than one. A total of 1,000 independent samples, each 
of size n = 1,652, were drawn from this fixed population by 
simple random sampling without replacement within strata 
defined by the 29 Australian agricultural regions represented 
in the AAGIS sample. These regions are the small areas of 
interest. Regional sample sizes were fixed to be the same as 
in this original sample, varying from a low of 6 to a high of 
117, which allows an evaluation of the performance of the 
different estimation methods across a range of realistic small 
area sample sizes. Note that sampling fractions in these 
strata also varied disproportionately, ranging between 0.70 
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and 15.87 percent. The aim is to estimate average annual 
farm costs (TCC, measured in A$) in each region using 
farm size (hectares) as the auxiliary variable. The same 
mixed model specification as in Chandra and Chambers 
(2005) is used. This includes an interaction term (zone by 
size) in the fixed effects and a random slope specification 
for the area effect. In its linear form the model does not fit 
the AAGIS sample data terribly well. This fit is improved 
(albeit marginally) when a log-scale linear specification is 
used. Our results are summarized in Table 4. 


6.3 Discussion of simulation results 


The most striking feature of Table 2 is the extremely 
large values of the averages relative bias of HJ-TrMBD 
under model-based model-calibrated weighting. The two 
best performers with respect to relative bias are HT- 
TrMBD, which is based on the same weights as HJ- 
TrMBD, and TrEP. An investigation of the reason for the 
poor performance of HJ-TrMBD revealed that summing 
the model-based model-calibrated weights (17) within small 
areas produced extremely variable estimates of the small 
area population sizes, implying that these weights cannot be 
considered as ‘multipurpose’ — they function well when 
used with variables that are reasonably correlated with the 
variable that defines the fitted value model, but can fail with 
other, less well correlated, variables (e.g., the indicator 
variable for small area inclusion). We further note that this 
problem does not arise with the ‘standard’ empirical 
EBLUP weights for the total (6), as HJ-LinMBD performs 
consistently for all six of the scenarios explored in Set A of 
the simulation study. From now on we therefore focus our 
discussion on the four estimators, HT-TrMBD, TrEP, HJ- 
LinMBD and HT-LinEBLUP. 


Table 4 
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Table 2 shows that the average relative biases and the 
average relative RMSEs for HT-TrMBD are consistently 
lower than those generated by HJ-LinMBD and HT- 
LinEBLUP. The average relative biases of HT-TrMBD and 
TrEP are comparable. However, the average relative 
RMSEs of HT-TrMBD are consistently smaller than the 
TrEP. Furthermore, average coverage rates and interval 
widths for HT-TrMBD are better than those generated by 
HJ-LinMBD and HT-LinEBLUP. In comparison, for the 
same order of relative bias, the relative RMSEs of HT- 
LinEBLUP is smaller than that of HJ-LinMBD, and, 
although both estimators generate very similar coverage 
rates, confidence intervals generated via HT-LinEBLUP 
tend to have smaller average widths than those generated via 
HJ-LinMBD. 

The plots in Figure | display the region-specific perfor- 
mance measures generated by these four estimators for the 
Set A simulations. These show that the relative bias and the 
relative RMSE values generated by HT-TrMBD are smaller 
than corresponding values for HJ-LinMBD and HT- 
LinEBLUP in all regions. With almost identical values of 
relative biases, the HT-TrMBD has smaller values of 
relative RMSEs than corresponding values for TrEP in all 
regions. Further, the relative bias and the relative RMSE of 
HJ-LinMBD and HT-LinEBLUP increase as the non- 
linearity in the data increases (i.e., as we move from para- 
meter set 1 to parameter set 6). We also see that HT- 
TrMBD generates better coverage rates across all regions 
compared with the coverage rates generated by HT- 
LinEBLUP and HJ-LinMBD. 


Average relative bias (ARB), average relative RMSE (ARRMSE) and average coverage rate (ACR) for design-based simulation using 
AAGIS data. Simulation standard errors of ARB and ARRMSE are shown in parentheses 
Re a a ee re oe 


Criterion Estimator Average of 29 regions Average of 28 regions 
ARB,% HT-TrMBD 1.96 (0.20) 1.92 (0.11) 
HJ-LinMBD Dal 3a(Os)) -2.21 (0.12) 
HT-LinEBLUP 2.98 (0.18) 3.36 (0.16) 
PseudoEBLUP 4.01 (0.22) 4.41 (0.20) 
Je 1.89 (0.19) P23 OniT) 
ARRMSE, % HT-TrMBD 21.93 (4.47) 17.41 (1.18) 
HJ-LinMBD 20.15 (3.80) 16.91 (2.20) 
HT-LinEBLUP 19.87 (1.78) 19.30 (1.63) 
PseudoEBLUP 22.42 (2.52) 21.95 (2.46) 
JL 20.97 (1.48) 20.48 (1.31) 
ACR HT-TrMBD 0.89 0.92 
HJ-LinMBD 0.93 0.95 
HT-LinEBLUP 0.85 0.85 
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Overall, these results show that when the model for the 
underlying population is non-linear there can be significant 
gains from the use of HT-type MBD estimators for small 
area means based on the model-calibrated weights (17) 
compared with standard linear mixed model-based esti- 
mators like HJ-LinMBD and HT-LinEBLUP. They also 
show that the indirect estimator HT-LinEBLUP performs 
relatively better than the direct estimator HJ-LinMBD in 
these situations. The indirect predictor TrEP based on log- 
transformed model (9) performs well in terms of relative 
bias but is less efficient than the MBD estimator under the 
same model. 

In Set B of the model-based simulations we investigated 
the robustness of model-based model-calibrated direct 
estimation to misspecification of the non-linear model. The 
results in Table 3 show that in this case the biases generated 
by HT-TrMBD increase as the actual non-linear model 
deviates more from the assumed non-linear model (y = 0.0 
in the table). However, these biases are offset by small 
variability, so in terms of average relative RMSE, HT- 
TrMBD still performs as well or better than HT-LinEBLUP 
and continues to dominate HJ-LinMBD. The biases 
generated by HJ-LinMBD and HT-LinEBLUP are of the 
same order, while the average relative RMSE of HT- 
LinEBLUP dominates that of HJ-LinMBD. Average 
coverage rates for HT-LinEBLUP are marginally better than 
those of HJ-LinMBD and HT-TrMBD, but the average 
widths of the confidence intervals underpinning these rates 
tended to be smallest for HT-TrMBD, followed by HT- 
LinEBLUP and then HJ-LinMBD. Overall, our model- 
based simulation results for Set B indicate that although 
MBD-based SAE with model-based model-calibrated 
weights is susceptible to model misspecification bias, the 
overall performance of this approach appears relatively 
unaffected by slight deviations from the assumed non-linear 
model. 

In Table 4 and Figure 2 we present the average and 
region-specific performance measure generated by different 
SAE methods for AAGIS data respectively. These results 
show that the average relative bias of HT-TrMBD is smaller 
than that of both HT-LinEBLUP and HJ-LinMBD, while 
the average relative RMSE of HT-TrMBD is marginally 
larger than the corresponding values for HJ-LinMBD and 
HT-LinEBLUP. Inspection of Figure 2 shows that this result 
is essentially due to one region (21) in the original 
AAGIS sample that contained a massive outlier (TCC > 
A$30,000,000). This outlier was included in the simulation 
population (twice) and then selected (in one case, twice) in 
37 of the 1000 simulation samples, leading to completely 
unrealistic estimates for region 21 being generated by HT- 
TrMBD and HJ-LinMBD. The right-hand column in Table 
4 therefore shows the average performances of the different 
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methods when this region is excluded. Here we see that now 
HT-TrMBD and HJ-LinMBD are essentially on a par, with 
both dominating HT-LinEBLUP. The fact that HT-TrMBD 
does not provide significant gains over HJ-LinMBD in this 
case reflects the fact that the raw-scale and log-scale linear 
mixed models used in these estimators both provide 
relatively poor fits to the AAGIS data. 


Percentage Relative Bias 


T [> Sie ae a) Le 
1 Sao Me Sevag 23I9sn 27 20 
Regions 


Percentage Relative RMSE 


05 Sis T T th oT T T =i} T ali 1 | 1 ae T 
Ws 7 Oh IB BY APS il GR ISy ora h Xs) 
Regions 


Coverage Rate 
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LS Serine) SoS MIS 7 ON 2525. 97) 29) 
Regions 


Figure 2 Region-specific simulation results for HT- 
TrMBD (thick line, 0), HT-LinEBLUP 
(thin line A) and HJ-LinMBD (dashed line, 
A) in design-based simulations based on the 
AAGIS data. Plots show (in order from the 
top), RB (%), RRMSE (%) and CR. 
Regions are ordered in terms of increasing 
population size 
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7. Conclusions and further research 


The simulation results discussed in the previous section 
show that combining model-based model-calibrated weights 
with direct estimation can bring significant gains in SAE 
efficiency if the population data are clearly non-linear. As 
one would expect, these gains are less when the assumed 
non-linear model is misspecified. Although we do not 
provide the details, our conclusions were essentially unaf- 
fected when we carried out similar simulations using 
gamma distributed random effects. 

Our main caveat concerning the use of the model-based 
model-calibrated weights (17) for SAE is their specificity. 
These weights do not appear to have the same ‘multi- 
purpose’ characteristics as standard EBLUP weights for the 
total based on linear mixed models. Further research is 
therefore required on how to build model-calibrated weights 
for SAE that are more ‘general purpose’. It is to be expected 
that such weights would not be as efficient as the variable 
specific weights (17), but hopefully this will be more than 
offset by their increased utility. A further issue that is 
extremely important in practice is that positively skewed 
survey variables can also take zero (or even negative) 
values. For example, economic variables like debt and 
capital expenditure often take zero values, while variables 
defined as the difference of two non-negative quantities 
(e.g., profit, which is the difference between income and 
expenditure) can be negative. Karlberg (2000b) uses a 
mixture model to characterise data that are a mix of zeros 
and strictly positive values. This type of model can be used 
in model-based model-calibrated weighting. 

Finally, we note that using a transformation-based MBD 
approach where the usual linear model assumptions are only 
approximately valid (the situation considered in this paper) 
is not the only approach that has been suggested for this 
problem. Two alternative approaches in the literature are the 
pseudo-EBLUP (Rao 2003, section 7.2.7) and the model- 
assisted EB-type estimator of Jiang and Lahiri (2006). 
Recollect from (8) that the EBLUP is defined by replacing 
the unknown area i mean m,, by an estimate of its expected 
value given the observed sample values of Y in area i and 
the area i values of X. Let 1,, denote the sample inclusion 
probability of population unit 7 in small area i. The pseudo- 
EBLUP is then defined by replacing m,, by an estimate of 
its expected value given the value of its design-consistent 
estimate 


-] 
(Sor -l =i ) — A? 
my = pat Nj pose Ny iy ae Wi (21) 


and the area 7 values of X. That is, under (3) the pseudo- 
EBLUP of m,, is 
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« psuedoEBLUP 
Mi, 


T 
ly? 


= E{m,| Tio Meta 


= x; B, + (Fz Bin) 
al 
— e = a2 ~2 AT =I AQ 
(BSB Fs Oot a dean wi; J (Mm, — XigBy) (22) 


where B.. 2, and 62, are pseudo-maximum likelihood 
estimates based on the weights w,, and g,, and Xj, are 
design-consistent estimates of g, and X, that are defined in 
exactly the same way as m above. Under the same model 
the Jiang and Lahiri (2006) model-assisted EB-type ap- 
proach leads to an estimator that is also defined by condi- 


tioning on the value of m;, 


my = Die wy, E{E(yy| x, ,u,) |7y,,x)} 


A 


st Xi0P ate {Wi, (Bis uBis ss él, )Wis a 


{Wi g,,.2,8'W, } (m;, ag XB) (23) 


where W,, is the vector of standardised sample weights ¥), 
in area 7. Note that in (23) we use optimal (i.e, ML or 
REML) estimates for model parameters. 

Both (22) and (23) are essentially motivated by the idea 
of estimating the area 7 mean by its conditional expectation 
under (3) given the value of the usual design-consistent 
estimator (21) for this quantity. As such, they are indirect 
estimators like the HT-LinEBLUP. Under (3), neither will 
be as efficient as the HT-LinEBLUP, while if (9) rather than 
(3) holds, then both estimators rely on the design consis- 
tency of m* for robustness. Since relying on a large sample 
property of a small sample statistic seems rather optimistic, 
we prefer to tackle the model specification problem directly, 
replacing (3) by (9) and using the transformation-based 
MBD approach described in section 5. Values of average 
relative bias and average relative RMSE for the pseudo- 
EBLUP (22) and the Jiang and Lahiri estimator (23) are 
shown in Table 4. It is interesting to note that neither 
estimator appears to perform any better than the standard 
EBLUP in these design-based simulations, and all three are 
substantially out performed in terms of average relative 
RMSE by the two MBD-type estimators that were invest- 
tigated in this study. Clearly the results of a single (but 
reasonably realistic) simulation study should not be con- 
sidered as anything more than indicative. However, they do 
provide some evidence that asymptotic design-based proper- 
ties are no guarantee of small area estimation performance. 

The indirect predictor (20) of the small area mean is 
obtained by using well known prediction-based ideas. 
Under log transformed models, there are alternative ap- 
proaches to obtain better indirect predictor for small area 
mean. For example, Slud and Maiti (2006) described an 
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indirect predictor for the small area mean under an area 
level version of the log transformed model (9). Berg (2009, 
private communication) follows the Slud-Maiti approach to 
obtain a predictor for small area mean under a random 
intercepts specification of the unit level log transformed 
model (9). However, like the Slud-Maiti predictor, Berg’s 
predictor ignores the bias correction necessary after back- 
transformation to the raw scale. The empirical properties of 
this predictor have yet to be examined. 


Acknowledgements 


The first author gratefully acknowledges the financial 
support provided by a PhD scholarship from the U.K. 
Commonwealth Scholarship Commission. Constructive 
comments from Editor, Associate Editor and two referees 
are also gratefully acknowledged. They resulted in the 
revised version of the article representing a considerable 
improvement on the original. 


References 


Bates, D.M., and Pinheiro, J.-C. (1998). Computational Methods for 
Multilevel Models. http://franz.stat.wisc.edu/pub/NLME/. 


Carroll, R., and Ruppert, D. (1988). Transformation and Weighting in 
Regression. New York: Chapman and Hall. 


Chambers, R., and Tzavidis, N. (2006). M-quantile models for small 
area estimation. Biometrika, 93, 255-268. 


Chandra, H., and Chambers, R.L. (2005). Comparing EBLUP and C- 
EBLUP for small area estimation. Statistics in Transition, 7, 637- 
648. 


Chandra, H., and Chambers, R. (2009). Multipurpose weighting for 
small area estimation. Journal of Official Statistic, 25(3), 379-395. 


Chandra, H., Salvati, N. and Chambers, R. (2007) Small area 
estimation for spatially correlated populations. A comparison of 
direct and indirect model-based methods. Statistics in Transition, 
8, 887-906. 


Chen, G., and Chen, J. (1996). A transformation method for finite 
population sampling calibrated with empirical likelihood. Survey 
Methodology, 22, 139-146. 


51 


Harville, D.A. (1977). Maximum likelihood approaches to variance 
component estimation and to related problems. Journal of the 
American Statistical Association, 72, 320-338. 


Hidiroglou, M.A., and Smith, P.A. (2005). Developing small area 
estimates for business surveys at the ONS. Statistics in Transition, 
7, 527-539. 


Jiang, J., and Lahiri, P. (2006). Estimation of finite population domain 
means: A model-assisted empirical best prediction approach. 
Journal of the American Statistical Association, 101, 301-311. 


Karlberg, F. (2000a). Population total prediction under a lognormal 
superpopulation model. Metron, LVIII, 53-80. 


Karlberg, F. (2000b). Survey estimation for highly skewed 
populations in the presence of zeroes. Journal of Official Statistics, 
16, 229-241. 


Longford, N.T. (2007). On standard errors of model-based small-area 
estimators. Survey Methodology, 33, 69-79. 


McCulloch, C.E., and Searle, S.R. (2001). Generalized, Linear and 
Mixed Models. New York: John Wiley & Sons, Inc. 


Prasad, N.G.N., and Rao, J.N.K. (1990). The estimation of the mean 
squared error of small area estimators. Journal of the American 
Statistical Association, 85, 163-171. 


Rao, J.N.K. (2003). Small Area Estimation. New York: John Wiley & 
Sons, Inc. 


Royall, R.M. (1976). The linear least-squares prediction approach to 
two-stage sampling. Journal of the American Statistical 
Association, 71, 657-664. 


Royall, R.M., and Cumberland, W.G. (1978). Variance estimation in 
finite population sampling. Journal of the American Statistical 
Association, 73, 351-358. 


Slud, E. V., and Maiti, T. (2006). Mean squared error estimation in 
transformed Fay—Herriot models. Journal of the Royal Statistical 
Society, Series B, 68(2), 239-257. 


Tzavidis, N., Salvati, N., Pratesi, M. and Chambers, R. (2008). M- 
quantile models with application to poverty mapping. Statistical 
Methods And Applications, 17, 393-411. 


Valliant, R., Dorfman, A.H. and Royall, R.M. (2000). Finite 
Population Sampling and Inference. New York: John Wiley & 
Sons, Inc. 


Wu, C., and Sitter, R-R. (2001). A model calibration approach to 


using complete auxiliary information from survey data. Journal of 
the American Statistical Association, 96, 185-193. 


Statistics Canada, Catalogue No. 12-001-X 


te atid 


ome: > ew! cage eat atah tenets @eaeras) «Sy! 
te sisi ladeth 8 OMG Geter bualueyr= 
- , «/A@d leh A. wane? 
tn Dame ee ? 'e ; - 
(fe (eee, ape gE. 6) Fee rou cig, aa 
ates) © ore, rege jit. Saya 
rary 
= 8 7 : & 
Se eG) 6 6 ae) bp 1S ne 
Deegye (1H SOA ad a anttaias, 
pbhe hs rs.) fag 
DE2Oe ml A a ee ee 
oy a ~\ i 7 ° aaa er 
ow & ; > e et) © pyllin 
; ernst ve Gh ae eho @ 
” YT oe 
. ote’ ry fn i j 
aT 
in? @ amy *.! ; inet’) oping 
gil j ? wae . Pia 
# My ? » ass ’ 
2a worn oe amy 
rn ) ( jt uuie 
: + >4aF rié 
oe 
i Mia | , ; ! } vi bé, wt 
>? pad ou wei 
U 
ulti pee a8 8 WW pate’ Paty y Wwe a any we 
. fhe’ 12 
ro iv @ os } ra wv 
é : ® _ 


—i pipe “ = ae iu hk oe tua, om ¢ 

le! we et YQow -— ae y ae vf, ela 
: °?) Gs") 4 i] gue’ 's y 
0. Bh Gee 6 4 yg he mee 7 
oa ee ae 


yt 


tr RAN 


j - 
per Fb ay, 
yer ie 
Seg ey ate i Us “i 
oer ee es 


path OL) CAs pO e 
A ys 6 | ary: 


«ews qe vous 


ae a Ps salt we? \i, nw an. 
Yee _o2 = 7s. ay, 2 
ag ) fay | wt : ; 28 ea Ss 
_ ; ~~ : ® ee 

ony 


<< 2 oe 
~ [En Ll Ss See 
. a a 


cLirwa.«@ 


ae, 
— Ths w ae 


7 aN va! (abt (8 cnn aga 
a es nyse poe a fs 


sag. 
ike - tue 
erty \eeeadie Sea 
ef) ww Noelia wT 
didpretaene © avert ‘ os 
are paki: 


win an 


left 


7 te & 


“nore dN a Jiviat' rh 
a am 


Sap SSytiga 
- 1° o> 
aa i Lam 
” 2 re 
a> 


Survey Methodology, June 2011 
Vol. 37, No. 1, pp. 53-65 
Statistics Canada, Catalogue No. 12-001-X 


53 


The construction of stratified designs in R with the package stratification 


Sophie Baillargeon and Louis-Paul Rivest ' 


Abstract 


This paper introduces a R-package for the stratification of a survey population using a univariate stratification variable X and 
for the calculation of stratum sample sizes. Non iterative methods such as the cumulative root frequency method and the 
geometric stratum boundaries are implemented. Optimal designs, with stratum boundaries that minimize either the CV of 
the simple expansion estimator for a fixed sample size n or the n value for a fixed CV can be constructed. Two iterative 
algorithms are available to find the optimal stratum boundaries. The design can feature a user defined certainty stratum 
where all the units are sampled. Take-all and take-none strata can be included in the stratified design as they might lead to 
smaller sample sizes. The sample size calculations are based on the anticipated moments of the survey variable Y, given the 
stratification variable X. The package handles conditional distributions of Y given X that are either a heteroscedastic linear 
model, or a log-linear model. Stratum specific non-response can be accounted for in the design construction and in the 


sample size calculations. 


Key Words: Linear models; Log-linear models; Optimal stratification; Survey sampling; Take-all stratum; Take-none 


stratum. 


1. Introduction 


The establishment of strata and the planning of a strati- 
fied design have been important topics in survey sampling, 
since the pioneering contributions of Dalenius more than 
sixty years ago. This work is concerned with univariate 
stratification where the strata are constructed using a posi- 
tive stratification variable X known for all the units of the 
population. X is assumed to be related to the survey vari- 
able Y. Stratum / contains all the units with an YX -value 
imjthe interval’ [b, ,,0,) for hk =1,...,0):such that b; = 
minX and b, = max X +1, where minX and max X 
are respectively the minimum and the maximum values of 
the stratification variable. 

The determination of optimal stratum boundaries has a 
long history, see chapter SA of Cochran (1977). The cu- 
mulative root frequency method (cum Ai ) of Dalenius and 
Hodges (1959) provides an approximate solution to this 
problem. Instances where X has a skewed distribution are 
frequent in business surveys and have been given a special 
emphasis. Gunning and Horgan (2004) proposed a geo- 
metric stratification method and Hidiroglou (1986) argued 
that the large units should be put in a take-all stratum. 
Rather than relying on an approximate method for con- 
structing the strata, Lavallée and Hidiroglou (1988) sug- 
gested an iterative algorithm that gives the optimal bound- 
aries for a particular X variable. Their algorithm sometimes 
fails to converge (Detlefsen and Veum 1991) and Slanta and 
Krenzke (1996) have shown that in some cases the optimal 
boundaries are not uniquely defined. Alternative methods, 
such as the search algorithm of Kozak (2004), have been 


proposed to alleviate some of these difficulties. The assump- 
tion that the survey variable Y is the same as the stratifi- 
cation variable X is not realistic when calculating sample 
sizes and several authors, including Dayal (1985) and 
Sigman and Monsour (1995), proposed to allocate the 
sample to the strata on the basis of the anticipated moments 
of Y knowing that X is in [b,_,,5,). Sweet and Sigman 
(1995) and Rivest (1999, 2002) suggested using these 
anticipated moments in the stratification algorithm of 
Lavallée and Hidiroglou (1988). Recently, Baillargeon and 
Rivest (2009) showed that putting the small units in a take- 
none stratum, which is not sampled, might reduce the sam- 
ple size needed to reach a predetermined precision level. 

This article introduces the R-package stratification that 
implements most of the methods presented above. It pro- 
vides a friendly computer environment to build stratified 
designs and to evaluate their performance on some real 
populations. This package is presented by revisiting exam- 
ples in the stratification literature selected to illustrate its 
important features. The four functions of stratification with 
the prefix strata construct stratified sampling designs. 
These functions are strata.cumrootf, strata.geo, 
strata.LH, and strata.bh. The first two implement the 
simple cum ./f and geometric stratification methods. The 
function strata.LH derives optimal stratified sampling 
plans using iterative algorithms while the last function 
handles user defined stratum boundaries. These four func- 
tions construct strata, determine stratum sample sizes and 
calculate the precision of the simple expansion estimator y, 
of Y, the population mean of some survey variable Y 
related to the stratification variable _X. 


1. Sophie Baillargeon, Département de mathématiques et de statistique, 1045, avenue de la médecine, Université Laval, Québec, (Qc) Canada G1V 0A6. 
E-mail: sophie.baillargeon@mat.ulaval.ca; Louis-Paul Rivest, Département de mathématiques et de statistique, 1045, avenue de la médecine, Université 
Laval, Québec, (Qc) Canada G1V 0A6. E-mail: louis-paul.rivest@mat.ulaval.ca. 
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The four strata-functions use Hidiroglou and Srinath’s 
(1993) rule to allocate the 1 units in the sample to the 
strata. The stratum sample sizes are proportional to N,," 
7, s°3, where N, is the size of stratum A, and Y, 
and S*, are the anticipated mean and variance of Y in 
stratum h. In the strata-functions, an allocation rule is 
specified by the argument alloc that contains the expo- 
nents (4,, 4, 93); Neyman’s allocation corresponds to 
alloc=c(1/2,0,1/2). A strata-function takes as an in- 
put the population vector of the stratification variable X, 
the number of strata Ls, and a total sample size n or a target 
CV for the simple expansion estimator y,. Its output is an 
R-object of class “‘strata” that defines a stratified design. It 
contains a set of strata determined by their upper boundaries 
{b,} and stratum population and sample sizes, NV, and n,,. 
There is a fifth function in stratification called var.strata 
that takes as an input an R-object of class strata and a popu- 
lation vector of a survey variable Y and returns the variance 
of ¥, for the input variable Y and the input stratified design. 

The text contains R instructions to be typed in an R com- 
mand window; these lines start with >. It also presents out- 
puts printed in an R command window. A special typeface 
allows an easy identification of these R instructions and 
print-outs in the text. The appendix contains a summary 
table that lists all the possible arguments of the five strati- 
fication functions. When using this package, the R-instruc- 
tion help (stratification) calls a clickable help file that 
provides detailed information on the package and examples 
that can be pasted in a command window. 


2. Basic stratification methods 


This section discusses two elementary stratification meth- 
ods, the cumulative root frequency method of Dalenius and 
Hodges (1959) and the geometric method of Gunning and 
Horgan (2004). These two methods are exact; they do not 
rely on an iterative algorithm. Throughout this section 
Y = X, so that the variance of ), is evaluated using the 
values of the stratification variable X. Using the same 
variable to stratify a population and to evaluate the preci- 
sion of survey estimates might underestimate their vari- 
ances. The calculation of variances when Y # X 1s con- 
sidered in Section 4. 


2.1 Cumulative root frequency method 


This stratification algorithm, presented in chapter SA 
of Cochran (1977), is implemented by the function 
strata.cumrootf. Its arguments are x, the population 
vector of the stratification variable, nclass the number of 
bins of equal size for the x-variable, a target cv for y, ora 
predetermined sample size n, the number of strata Ls, and 
an allocation rule alloc. This algorithm pools the nclass 
bins into Ls strata in such a way that the sums of the square 
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roots of the bin frequencies are approximately equal for the 
Ls Strata. 
As an illustration, consider the proportion of industrial 
loans of N = 13,435 banks used in Cochran (1961). We 
stratify this population and evaluate the sample size needed 
for y, to have a CV of 5% when Neyman allocation is 
used. The following R-code creates the vector of the strati- 
fication variable loans from Table 2 of McEvoy (1956). 
The function strata.cumrootf is then applied to the 
loans variable. Following Table 2 of Cochran (1961), 
nclass is set to 20 so that the strata will be created using 20 
bins and Ls=3 strata will be constructed. The output is 
placed in cum, an R-object of class strata. Typing cum or 
print (cum) in the R command window prints details of the 
sampling plan. The input arguments, either the default or as 
specified by the user, appear first. Then stratum information 
is provided such as boundaries, sizes N,, and sample sizes 
n,. The third part of the print-out provides information 
about the sampling properties of V,. 
> values <— e(seq(0.5, 9.5, 1), seq(l2°5, 97.5, 3)') 
> nrep <- c(1985, 261, 339, 405, 474, 478, 506, 569, 464, 499, 
2157, 158i, W427 746, SIZ, 376, 265, 207, 126) LO tea Oy 
BO Zoi, Gy, LOA ne) 

> loans <- rep(values, nrep) 

> cum <- strata.cumrootf(x = loans, nclass = 20, CV = 0.05, 


ES = 3, alloce= EVO. SO, OLay) 
> cum 


Given arguments: 
x = loans 
nclass = 20, CV 
allocation : ql 
model = none 


Wool 
fo) 
u 
cs) 
it 
So 
Q 
we) 
i] 
io) 
uw 


Strata information: 
rh| bh anticip.Mean anticip.var Nh nh fh 


Stratum 1 ia pOrs2 4.12 10.46 S980) 4) (O00 
Stratum 2 I AES 7.22 27.14 5626 20) 0500 
Stratum 3 US) Ses 44.47 HES.183 L829 6) VOOx 
Total 13435 50 0.00 


Total sample size: 50 
Anticipated population mean: 15.39408 
Anticipated CV: 0.0494897 

In the Given arguments, model=none means that the 
sampling properties of V,, presented at the end of the print- 
out, are evaluated at Y = _X, that is for the loans variable. 
Its mean is 15.39408 and the anticipated cv of 0.0494897 is 
that of the estimator y, of the mean of the variable loans 
obtained with this sampling design. The stratum boundaries 
given in this output are (10.2, 29.6, 98.5), they are equal to 
those appearing at the bottom of page 349 of Cochran 
(1961), once the rounding used for creating the vector 
loans is accounted for. In the Strata Information, 4% 
refers to the stratum response rates that are discussed in 
Section 5.1. The R-object cum contains several elements 
that are listed by the command names (cum) . 


> names (cum) 

Pe Ni Lingle) Mreip) "nh.nonint" "certain.info" 
[6] “opti.criteria" "bh"  "meanh" "varh" "mean" 

{11] "stderr" "Cv" "stratumID""nclassh" "takeall" 

[ated Micali "date" "args" 
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An element in the cum strata object can be printed by 
typing cum$ followed by the name of the object. For 
instance the cum$stratumID prints the stratum of each unit 
in the population. The variable cum$nclassh is specific to 
the strata.cumrootf function; it gives how the 
nclass=20 original bins have been pooled into three strata; 


> cum$nclassh 
[22-4 14 

Thus, in this stratification, strata 1, 2 and 3 contain 
respectively 2, 4 and 14 of the nclass=20 original bins. 


2.2 Geometric method 


The geometric stratification method has been introduced 
by Gunning and Horgan (2004). It sets the stratum bound- 
aries to b, = min X x (max.X/min X)"", for h =1,..., 
L —1. Once the boundaries b, are determined, the stratum 
sample size calculations are the same as those carried out in 
strata.cumrootft. 

As an illustration we stratify the four populations 
presented in Gunning et Horgan (2004), Debtors, USbanks, 
UScities, and UScolleges, into Ls=5 strata. The last three 
populations were considered in Cochran’s (1961) investi- 
gations. These four populations are stored in stratification; 
the command data(Debtors) calls the first one. Rather 
than specifying a target CV we set the sample size to n = 
100 following Gunning and Horgan (2004). The following 
commands create the R-object pop1 that contains the 
stratified design for the Debtors population. 


> data (Debtors) 
> popl <- strata.geo(x = Debtors, n= 100, Ls = 5, 
aloe = 6 (10.5; 0, 1055) ) 


Table 1 summarizes the geometric stratified designs for 
the four study populations. It reproduces Table 4 of Gunning 
and Horgan (2004) partially. There are however some minor 
differences caused by different rounding strategies. More 
details about stratification rounding methods are available 
in the help file. 


Table 1 
Stratified designs for four populations with n = 100 
Population CV 1 2 3 4 5 
Debtors 0.0359 by 148.28 549.67 2,037.60 7,553.33 
Np 1,054 = 1,267 32 265i Sil 
Np 3 14 7a 33-23 
UScities 0.0145 by, Seo 59.98 108.98 
N, 364 418 130 sh Sh9) 
Ny 18 28 17 DOME H 
UScolleges 0.0183 by 434.00 941.76 2,043.61 4,434.60 
N, 940 114255 198 74 56 
Np 3 15 27 20a. 35 
USbanks 0.0107 by 118.59 200.92 340.39 576.68 
Np 114 116 64 3924 
Np 13 20 25 13 824 


59 


2.3. Take-all stratum 


In Table 1, the fifth stratum for the USbanks population 
is a take-all stratum since n, = N, = 24. Under Neymann 
allocation, the fifth stratum gets a sample size n, larger than 
the stratum size N,. Then strata.geo automatically 
identifies this stratum as a take-all stratum and allocates the 
n—N, units for the first four strata using Neyman allo- 
cation. This adjustment is important to have a sample size of 
n = 100 as specified in the strata. geo arguments. 

To illustrate this point, we use the function strata.bh to 
make an allocation without a take-all stratum adjustment. 
This function allocates the sample and calculates the 
precision of y, for a predetermined set of stratum bound- 
aries. By setting takeall.adjust=FALSE, Neyman 
allocation is used in the five strata and since n, > N, one 
has n; = N;. The following R-code gets the geometric 
stratum boundaries {b,} in the strata object adjust; it then 
uses the strata.bh function with the geometric stratum 
boundaries to get the sampling design without adjusting for 


a take-all stratum five in the noadjust strata object. 

> data (USbanks) 

> adjust <- strata.geo(x = USbanks, n = 100, Ls = 5, 
alioe =ieEi(055, 0, (OS) 

> noadjust <- strata.bh(x = USbanks, bh = adjustSbh, 
Dn = 1007) is =-5, alilec = oO 6.07 OnS).,. takeall = 0, 
takeall.adjust = FALSE) 


The two designs are presented in Table 2. Failing to 
include a take-all stratum yields a sample size of n = 99, 
smaller than the target n = 100. In this case, the unrounded 
sample size for stratum 5 is noadjust$nh.noint[5]= 
25.40 for N; = 24 units. Note that when 7 is large or 
when the target CV is small, it is possible to get several 
take-all strata. 


Table 2 
Stratified designs obtained with and without an automatic 
adjustment for a take-all stratum 


n 1 2 3 4 5 
bp, 118.59 200.92 34039 576.68 
Np 114 116 64 39 24 
adjust 100 | mn, 13 20 25 18 24 
noadjust 99 Ap, 13 20 24 18 24 


2.4 Adding a take-all stratum 


We now consider the data base on N = 284 Swedish 
municipalities given in the appendix of Sarndal, Swensson 
and Wretman (1992). The following instructions use the geo- 
metric method to stratify this population in Ls=5 strata using 
the variable REV84, the 1984 real estate values. The power 
allocation with exponent 0.7 and alloc=c(0.35,0.35,0) 
is used. The R-object of class strata geo contains the 
stratified design. The command plot (geo) produces the 
plot presented in Figure 1. It provides a histogram of the 
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stratification variable with the stratum boundaries and a 
summary table for the stratified design. 
> data (Sweden) 


> geo <- strata.geo(x = SwedenSREV84, CV = 0.05, Ls = 5, 
alloc = GX0.so, O35, (0))) 


Graphical Representation of the Stratified Design geo 
1 2 3 4 os 


Density 


0 10,000 20,000 30,000 40,000 


Stratification variable X 


50,000 60,000 


Figure 1 Plot of the R-object geo 


Figure | shows that the geometric stratification method 
puts two of the three extreme REV84 values in a take-all 
stratum. The following Rceode creates cum a stratified 
design for this population using the cum ./f method. The 
application of this stratification method is awkward since 
the bins have length {max (REV84)-—min (REV84) }/50= 
1191. Considering Figure | most of the bins have a null 
frequency; indeed stratum 5 comprises 43 of the 50 bins. 
This design does not have a take-all stratum. To calculate 
the sample sizes obtained by requesting a take-all stratum 
one can use the function strata.bh, with the cum Jf 
boundaries stored in cum$bh, with the command 
takeall=1. This gives the third sampling plan in Table 3. 
The fourth sampling plan of Table 3 cum3 is created by 
setting the sample size in stratum 5 of the cum | f 
design equal to its population size with the command 
cum3$nh[5]<-cum3$Nh[5]. The variance of the estimate 
y, for the variable REV84 using this fourth sampling design 
is calculated using var.strata. 


> cum <- strata.cumrootf (x = SwedenSREV84, nclass = 50, 
CVE O05, Ls = 16, alloc = ic(0. Ss, u0r soy 10). 

> cum2 <- strata.bh(x = SwedenSREV84, bh = cumSbh, CV = 0.05, 
Ls = 5, takeall = 1, alloc = c(0.35, 0.35, 0)) 

= Cums.<— Cum 

> cum3Snh[5] <- cum3SNh[5] 

> cum3.var <- var.strata(cum3, y = SwedenSREV84) 
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l 
Paper designs for the population of Swedish municipalities 

Method 1 Vivid methenen 5. uin demOYs 

geometric Np Le ey OE ee ee 
Np 3 Tit NO Bet De eB eBags 

cum /f Np, 120i 70 cet 22 Teel 
Np 7 Fine Ge eS eo lOn et eens, 
hah 2 2ie3% Decls: JAeaes 
ade 7 7 9 oo8 15 46 223 


Table 3 highlights that the sampling fraction in the fifth 
stratum drives the value of n. The cum ./f design appears 
to be less efficient than the geometric design since it 
sampling fraction in stratum 5 is 10/15 = 67%. Requesting a 
take-all stratum gives a value of comparable to that 
obtained with the geometric design. The REV84 population 
has three outliers that were identified in Table 1. The 
geometric and cum./f stratification methods depend 
heavily on the maximum X -value; therefore before ap- 
plying these techniques it might be wise to put the three 
outliers aside. This is considered 1n the next section. 

The simple adhoc method to arbitrarily change the 
stratum sample sizes presented in this section can be applied 
in several situations. For instance, when some strata have 
samples of size 1, they can be increased to 2 in order to have 
an unbiased variance estimator. 


2.5 Certainty stratum 


In a stratified design it might be useful to constrain some 
units to be sampled, before constructing the strata. The 
argument certain available in the four strata-function 
makes this possible. As an example we revisit the compari- 
son of the cum/f and the geometric sampling designs 
presented in Table 3. The three large municipalities high- 
lighted in Figure | are put in a certainty stratum, and the 
N =281 remaining municipalities are stratified into Ls=4 
strata using the two stratification methods. The R-code for 
constructing these two designs is given below. The com- 
mand x=sort (SwedenSREV84) orders the municipalities by 
increasing REV84; thus the three large municipalities are 
entries 282, 283 and 284 of the sorted vector. The two R 
objects of class strata, geo cer and cum cer, each contain 
an element certain. info that provides information on the 
certainty stratum. 


\ 


> geo_cer <- strata.geo(x = sort (SwedenSREV84), CV = 0.05, 
S = 4, alloc = c(0.35, 0.35, 0), certain = 282:284) 
> cum_cer <- strata.cumrootf (x = sort (Sweden$REV84), 
nelass = 50,,.CV = 0.05, ts = 4, alfoc =.¢(0.35, 0.85, 0); 
certain = 282:284) 
> cum_cer$certain. info 


Nc meanc 
3.00 38923 .67 


Survey Methodology, June 2011 


In Table 4, the cum ./f design is more efficient that the 
geometric design. Putting the three large municipalities in a 
certainty stratum is helpful since the sample sizes in Table 4 
are smaller than those of Table 3. The argument certain 
can force any set of units in the sample. It can be used to 
include units that are extreme for a secondary variable, 
different from the stratification variable, or that have a 
history of high volatility. 


Table 4 
Two stratified designs for the Swedish municipalities constructed 
with a certainty stratum 


Method 1 Bie led iehSin i Sa- CV 
geometric Np 42 UGS Sas 

Np 2 See eee goes A een Tl 
cum Jf N, 127 79°" 46 929 "3 

Mp 3 AP aS sora 


3. Optimization method 


The stratification methods introduced in Section 2 do not 
always give an optimal stratified design, that minimizes the 
sample size n needed to reach the target CV (or minimizes 
the CV for a fixed 7). This section introduces the function 
strata.LH that allows the determination of optimal 
designs. The name LH stands for Lavallée and Hidiroglou 
(1988) who pioneered the construction of optimal stratified 
designs for real life survey populations. In a stratified design 
with a take-all stratum, the variance of the simple expansion 
estimator is given by 


vary.) = 5 {e) See tad Se 
se re Wee he Giae Nae an NGet) 2, 


where {a,} is the allocation rule for setting stratum sample 
sizes. The n that ensures a CV of ¢ is given by 


L-l 
DNS, / (a,N*) 
n=N,+— : 


(1) 


i=] ‘ 
ss coatgp Wie ean 
1 


In this expression one can write n = ACD, scene tO 
highlight that the value of n depends on the stratum bound- 
aries. The strata.LH function tries to find the optimal 
boundaries 6, that minimize n(b,,..., b,_,). Two minimi- 
zation algorithms are available, either Sethi’s (1963) algo- 


rithm as implemented by Lavallée and Hidiroglou (1988) 


with algo="Sethi" or Kozak’s (2004) random search 
algorithm with algo="Kozak". The latter is the default 
option. This section assumes Y =_Y; it does not distin- 
guish the stratification from the survey variable. 
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3.1 Sethi (1963) example with the normal 
distribution 


A classical problem is to determine the optimal bound- 
aries for L strata in an infinite population from a known 
distribution. For instance, Sethi (1963) derived the optimal 
bounds for the normal and the x3, distributions. To obtain 
approximate solutions, one can run the strata. LH function 
on a large Monte Carlo population simulated from the 
known distribution, without requesting a take-all stratum. In 
(1), one has N,,/N* = 0 and the optimal boundaries are the 
same for any target CV c. 

The following R-code simulates populations of size 10° 
from the N(10,1) and the y;, distributions. Observe that 
stratification requires the stratification variable to be non 
negative, so that it would not work on standard normal 
deviates. By subtracting 10 from the N(10,1) bound- 
aries, we get the ones for a N(0, 1). The calculations are 
done with the strata.LH function with the argument 
algo="Sethi" and with takeall=0, so that a take-all 
stratum is not requested. 


> Z <- rnorm(100000, 10) 
= 215) <— Strata. bax — Zz, CV = 0.001, Ls = 5, 

alioes— (e007, 0),0.5)i-5 cakeall =O) algo = "Sethi") 
> z15$bh - 10 


[1] -1.1247340 -0.3480829 0.3297044 1.0979017 


> x30 <- rchisq(100000, 30) 
> ela <— Sttataua(x = 30, CV = Ol0L, ms = 5, 

ahlog = cii0lo)) On 055))wtakeall = 0, algo = "Sethi") 
> x15Sbh 


[1] 22.82148 28.12303 33.38642 40.20165 


In Table 5, the agreement between the true bounds 
reported in Table 8 of Sethi (1963) and the Monte Carlo 
bounds is quite good. This approach could be used to 
calculate the optimal stratum boundaries for an arbitrary 
distribution, see for instance Khan, Nand, and Ahmad 
(2008). 


Table 5 
Comparison of Sethi’s (1963) optimal stratum boundaries and 
of the approximate boundaries obtained with stratification 


stratification’s results Sethi’s results 
L 1 2 3 Bal 1 2 3 4 
2 | -0.007 0.00 
by, 3 |-Or sol OG) -0.55 0.55 
N(0,1) 4 | -0.883 -0.008 0.864 -0.88 0.00 0.88 
Bn e250 2480330 1.098 | -1.11 -0.34 0.34 1.11 
2 | 30.674 30.6 
by, 3 | 26.535 35.141 26.0 35.0 
130 4 | 24.340 30.733 38.179 24.0 30.6 38.0 
5 | 22.821 28.123 33.386 40.202 | 22.0 28.0 33.0 40.0 


3.2 Gunning and Horgan (2004) example 


In their original proposal, Lavallée and Hidiroglou 
(1988) always had a take-all stratum for a skewed survey 
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variable. To show that this was not always mandatory, 
Gunning and Horgan (2004) derived the optimal stratified 
designs featuring a take-all stratum for the four populations 
considered in Table 1. The findings of their Table 7 (with 
slight corrections due to rounding errors) is reproduced in 
Table 6. Comparing Tables | and 6, one sees that the opti- 
mal designs featuring a take-all stratum have n-values 
larger than 100 for three populations out of four. The op- 
timal design is superior to the geometric design only for the 
Debtors population. The R-code to run Sethi’s algorithm on 
the Debtors population is given below. 
> pop1LH <- strata.LH(x = Debtors, CV = 0.0359, Ls = 5, 

alloe = c(055, 0, 0.5), takeall = 1, algo = "Sethi") 

In Table 6, one would expect the optimal designs 
obtained through an iterative algorithm to have a smaller 
sample size than the ad hoc geometric designs. This fails to 
occur for three populations. This might be caused by a 
failure of Sethi’s algorithm to find the true minimum value 
for n. To check this, we reran the programs to produce Table 
6 with the argument algo="Kozak". The sample sizes n 
are given in the second column of Table 7. Kozak’s 
algorithm finds a smaller -value that Sethi’s for three of 
the four populations. This highlights the weakness of Sethi’s 
algorithm for real populations. The second column of Table 
7 has n values larger than 100 for two of the four popu- 
lations. In these cases, the geometric design might be better 
because a take-all stratum is not required. To check this we 
reran Kozak’s algorithm wihout a take-all stratum, 7.e., with 
takeall=0. The results are reported in the third column of 
Table 7. For the Debtors and the UScolleges populations, 
taking away the take-all stratum reduces the sample size 7. 
Still, for the UScities population, Kozak’s algorithm does 
worst than the geometric design. It failed to find the true 
minimum value of n with the default arguments that 
control its random search. To better understand the re- 
sults of Table 7, we now present in more details the 
selection of initial stratum boundaries in strata.LH and 
the parameters that control the random search with 
algo="Kozak". 

Table 6 


Optimal stratified designs featuring a take-all stratum obtained 
with Sethi’s algorithm for the 4 populations of Table 1 


Population za CV 1 2 3 4 5 
Debtors 93 0.0359 b, 349.33 1,190.16 3,482.98 10,322.50 
Np 1,856 991 350 146 26 
Ap 13 17 17 20 26 
UScities 137 0.0145 b, [AN 2 262: Sno, 80.47 
Np 189 270 336 164 79 
Mp 4 8 16 30 79 
UScolleges 107 0.0183 6, 512.32 869.76 1,577.23 3,668.85 
Np 133 180 185 110 69 
Np 4 6 10 18 69 
USbanks 104 0.0107 5, 99.37 129.60 181.94 317.36 
Np, 70 66 82 65 74 
Ap 4 4 if 15_74 
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Table 7 
Sample size n for three optimal designs and four populations 


Population algo=Sethi algo=Kozak algo=Kozak 


takeall=1 takeall=1 takeall=0 
Debtors 93 92 82 
UScities 27 114 123 
UScolleges 107 107 95 
USbanks 104 88 88 


3.3. Customization of the algorithms 


The default initial stratum boundaries for the two 
iterative algorithms are the arithmetic starting point of 
Gunning and Horgan (2007), with 5, = min X + (max X — 
mnx)x<hA/L. for h=1,.. 0 — io Vable ees 
choice is questionable and the geometric stratum bound- 
aries would have been closer to the true optimal boundaries. 
In strata.LH, the argument initbh= allows to specify a 
vector of L—1 initial boundary values. The maximum 
number of iterations can be changed with the maxiter 
element of the algo.control argument. 

Kozak’s algorithm was first proposed in Kozak (2004), 
see also Kozak and Verma (2006). It uses a random 
search that selects the L —1 stratum boundaries among 
the sorted values of X, with the duplicates discarded. At 
one iteration, it randomly picks a number d in the set 
{-maxstep, -maxsteptl,..,maxstep} and one of the 
L —1 boundaries. Then it moves the selected boundary by 
d positions in the vector of sorted X -values. If (1) is 
smaller with the new boundary it is kept, otherwise it is 
discarded and the boundaries are left unchanged at this 
iteration. The algorithm stops when the boundaries have not 
been changed for maxstill consecutive iterations. The 
default values are maxstep=3 and maxstill=100. Two 
consecutive runs of Kozak’s algorithm might lead to 
different designs because of the random nature of this 
algorithm. The strata.LH runs the algorithm rep times 
and the information for each run is contained in the 
rep.detail element of R-objects of class strata; the 
default value is rep=3. If the rep runs lead to different 
designs, then the tuning parameters of the algorithm can be 
changed. One can also use use rep="change" which runs 
the algorithm 27 times with different starting and maxstep 
values. An additional example illustrating an instance where 
Kozak’s algorithm does not reach a global minimum is 
presented in the Appendix. 

With NV, unique YX -values, there are approximately 
(“"') possible sets of stratum boundaries. If this number is 
smaller than minsol all the possible sets of strata are tried, 
rather than carrying out a random search. The default value 
is minsol=1000. The elements maxstep, maxstill, 
minsol and rep belong to the algo.control argument. In 
Table 7, we were unable to improve the geometric stratified 
design for the UScities population. The command to run 
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Kozak’s algorithm 27 times with various tuning parameters 
is given below. 
> data (UScities) 
> pop2LHrep <- strata.LH(x = UScities, CV = 0.0145, Ls = oy 
allous—se(45, 0). 0.5). Gakeall, = Oy algo = "Kozak", 
algo.control = list(rep = "Change")) 
This command takes a few seconds to run and yields a 
stratifed design with » = 100, similar to that presented in 
Table | for the UScities. 


3.4 Designs with a predetermined sample size n 


With Kozak’s algorithm it is possible to find the 
boundaries that minimize the CV of ¥, for a fixed sample 
size n rather than minimizing n for a predetermined CV. 
As an example we revisit the stratified designs of Table 1. 
The geometric boundaries are used as initial values and the 
default Kozak algorithm is run. The R-code for the Debtors 
population is given below. 
> poplk <- strata.LH(x = Debtors, initbh = poplSbh, n = 100, 

ESV alloc = ie(0).5,. (0, (0.5), algo = "Kozak") 

The CVs of the estimator of Y, obtained with the 
optimal stratified designs are 3.12%, 1.43%, 1.72%, and 
1.04% for the four populations as compared with 3.59%, 
1.45%, 1.83%, and 1.07% in Table 1. Thus the iterative 
algorithm allowed to reduce the CVs. 


4. Stratification with anticipated moments 


A difference between the stratification variable Y and 
the survey variable Y can be accounted for by having a 
model for the conditional distribution of Y given X. In 
stratification, there is a log-linear model where 


Y = exp(a)X° exp(oe), 
and an heteroscedastic linear model with 
Y=a+BPX + oeX’, (2) 


and a, 8, and y are real parameters specified by the user 
and € is a N(0, 1) random variable. A random replace- 
ment model (Rivest 1999) is also available and stratum 
specific mortality rates (Baillargeon, Rivest and Ferland 
2007) can be added to the log-linear model. 

Under these models, the anticipated mean of Y for the 
units classified in stratum h, with X e [b, ,,b,) are 


eae) 


= l 
Wr 
h by _, SX; <b, 


while the anticipated variance is 


l 2 
Sete Oe SA iar eG) 
N, by, SX; <b, 
ee ie aro) 
N, b $X, <b 


h-| h 


a0 


where E(Y | X), is the average of the predicted values 
of Y for the units in stratum h. In strat.cumrootf, 
strata.geo and strata.bh these expressions are used to 
evaluate the sampling properties of p, while in strata. LH, 
the minimization of (1) is carried out with anticipated 
moments. In strata.LH the stratum boundaries depend on 
the model for the relationship between Y and Y; they do 
not for the other strata functions. 


4.1 An example with the WU284 Swedish 
municipalities 


In Section 2.5 two stratified sampling plans were derived 
for the MU284 population with REV84 as stratification 
variable. The R-code that follows investigates the perfor- 
mance of these sampling designs for the variable RMT85. 
The vector ord contains the position of the order statistics of 
the REV84 variable; thus y[ord] is the vector of the 
RMT85 variable, ordered by increasing REV 84 -value. 
> data (Sweden) 
> X <- SwedenSREV84 
> Y <- Sweden$RMT85 
> ord <- order (xX) 
> geo_rmt <- var.strata(geo_cer, y = Y[ord]) 


> cum _rmt <- var.strata (cum_cer, y = Yford]') 
> c(geo_xmt$RRMSE,cum_rmt$RRMSE) 


[1] 0.06889558 0.07368794 


In section 2.4, the CVs of the estimator y, for the 
stratification variable REV84 were less than 5% for the 
cum,/f and the geometric designs. When estimating the 
mean of RMT85, the CVs are larger than 6%. This 
emphasizes that calculating sample sizes with a stratification 
variable underestimate the n needed to reach the target CV 
for a different survey variable. These results are reported in 
the first two designs of Table 8. Table 8 also shows the 
optimal design calculated by applying Kozak’s algorithm to 
the REV 84 variable, assuming Y =_X. 

Following Rivest (2002), a log-linear model is fitted for 
the relationship between the two variables. As shown in 
Figure 2, there are outliers and the following R-code esti- 
mates the parameters of the log-linear model by discarding 
the municipalities with extreme X / Y quantiles. The 18 
discarded municipalities are represented by a star in Figure 
2. The R-code for fitting the model to the non outliers 
follows. 
> keep <- (X/Y > quantile (X/Y, 0.03)) & (X/Y < quantile (X/Y, 0.97)) 


> reg <- Im(log(Y) [keep] ~ log(X) [keep]) 
> coef (reg) 


(Intercept) log(X) [keep] 
Son lSs025 1058355 


> summary (reg) Ssigma 


[1] 0.25677 
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Figure 2 Plot of RMT85 by REV84 from the data set Sweden 
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The following code stratifies the MU284 population on 
REV84 using the cum /f and the geometric method. The 
allocation is however carried out with anticipated moments 
calculated with the log-linear regression model of RMT85 
on REV84. The strata of these two designs are the same as 
those calculated earlier. The model affects only the antici- 
pated CV. It is not so for the optimal design where the 
anticipated moments are used in the stratification algorithm. 
Kozak’s algorithm might fail to find the global minimum 7 
value when using anticipated moments; thus we use the 
bounds calculated with Y = X as starting values. 


> geo Gcer.m <— strata.geo\(x — Xjord], CV — 0-05, Ls — 47, 
alloc = c(0.35, 0.35, 0), model = "loglinear", 
certain = (length(X) - 2):length(X), model.control = 
list (beta = 1.058355, sig2 = 0.25677%2)) 

> (geo cer, var <— var. strata(geo cer.m, y — Yioerd]) 

> cum_cer.m <- strata.cumrootf(x = X[ord], nclass = 50, 
CV ="0.05, is ="4, alloc ="c(0lss, 0.35, 0), 
certain = (length(X) - 2):length(X), model = "loglinear", 
model.control = list (beta = 1.058355, sig2 = 0.25677°2)) 

> cum_cer.var <- var.strata(cum_cer.m, y = Y[ord]) 

> LH <— strata. LH(x = X, CV = 0.05, sr= 5, 
alloc = c(0.35, 0.35, 0), takeall = 1) 

> LH.var <- var.strata(LH, y = Y) 

> LH om <— strata. L(x, = X, GV = 0.705, Ls — 95, 
initbh = LHSbh, alloc = c(0.35, 0.35, 0), takeall = 1, 
model = "loglinear", model.control = list (beta = 1.058355, 
Sig2 = 0.25677°2)) 

> LH_m.var <- var.strata(LH_m, y = Y) 


In Table 8, sample sizes calculated with anticipated 
moments give CVs smaller than 5% for estimating the mean 
RMT85 variable. The optimal LH design requires a n 
slightly smaller than the other two. Accounting for Y # X 
when minimizing (1) gives a larger take-all stratum since its 
size increased from 4 to 5 when using the anticipated 
moments. 

Finally observe that the arguments model and 
model.control can be used with var.strata. For the 
geometric design considered in this section, one can get 
results very similar to those obtained with the argument 
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y=yY. As shown below, the model yields a CV of 6.894% as 
compared with 6.890% obtained with the original RMT85 - 
variable. For the cum /f method the model CV is 7.282% 
as compared to 7.369% found earlier while for the Lavallée 
Hidiroglou algorithm these two values are 7.080% and 
7.110%. 


> geo_rmt2 <- var.strata(geo_cer, model = "loglinear", 
model.control = list (beta = 1.058355, sig2 = 0.25677°2)) 
> geo_rmt2$RRMSE 


{1] 0.0689368 


Table 8 
Three stratified designs for estimating the mean RMT85 with 
REV 84 as the stratification variable 


Model Method 1 7 3 4 5 sn anticip. 
CV 

Y =X ten Ne TTS, @46.05 29/73 
Np 3 AeA pee Sio 3) wl D. aeuiag 

geometric Ny), AW) ANNoy“tehe} 35) 3 
Np 2 5-6 Fg OS. eS 24 we89 

LH Ny 120" 82045 en 
Np 3 AvnecAnely St. Jie 20m 

loglineary cum./f Nyse 127 79. A6n5, 29.5.3 
Np 6 8. O10, a a SO Mena 

geometric Ny, 42 116 88 35 3 
Np 3 8 13. 1S os Aue are 

LH N, 12% 81 45°32" 3 
Np 6 TF G25. B41 7490 


4.2 Anderson, Kish and Cornell (1976) example with 
the bivariate normal distribution 


Anderson et al. (1976) investigated the optimal stratifica- 
tion for Y based on X when (X, Y) has a bivariate 
normal distribution with correlation p. Thus model (2) 
holds with a = y = 0, B =p, and o* =1—p* where Y 
has a N(O, 1) distribution. To reproduce Anderson ef al. 
(1976) results, we generate a population of size N =10° 
from a N(0, 1) distribution and select model="linear" 
(as in Section 3.1 a mean of 10 was used to prevent X from 
being negative). For a linear model, only Kozak’s algorithm 
works. Given the special nature of the problem, the 
maxstep parameter is set to 20 and only one repetition 
(rep=1) of the algorithm is run. When there is no take-all 
stratum, the optimal stratum boundaries are independent of 
the CV, as in Section 3.1. We used CV = 0.01 im the 
calculations. 


> x <- rnorm(let+05, 10) 

> bi3a <- strata.LH(x = x, CV = 0.01, Ls = 3, takenone = 0, 
model = "linear", 
model.control = list (beta = 0.25, sig2 = 1 - 0.25%2, 
gamma = 0), algo.control = list (maxstep = 20, rep = 1)) 

> bi3a$Sbh — 10 


[1] -0.619354 0.604198 
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In Table 9, stratification’ s results are equal to Anderson’s 
et al. (1976) findings up to nearly two decimals. This high- 
lights the flexible nature of the package; it can find the 
optimal stratified design for any distribution of the stratifi- 
cation variable and for some general models for the condi- 
tional distribution of Y given _X, 


Table 9 
Comparison of Anderson et al. (1976) optimal stratum boundaries 
with the approximate boundaries obtained with stratification 


stratification’s results Anderson et al.’s results 


P| 1 2 3 4 1 2 3 4 
0.250 | -0.619 0.604 -0.61 0.61 
0.950 | -0.591 0.568 -0.58 0.58 
0.990 | -0.571 0.549 -0.56 0.56 

4 0.250 | -0.984 0.004 0.985 -0.98 0.00 0.98 
0.950 | -0.930 0.009 0.942 -0.93 0.00 0.93 
0.990 | -0.902 -0.001 0.895 -0.90 0.00 0.90 

55) 0250 21.245--0:377 0387 1.251) |/-1.24 <038) 088) 1.24 


0.950 | -1.187 -0.358 
0.990 | -1.136 -0.344 


0.372 
0.353 


HN || they Oy si O21. f@) 
1.144} -1.14 -0.35 0.35 1.14 


5. Additional features 


Baillargeon and Rivest (2009) considered additional 
aspects of a stratified design, namely stratum specific 
anticipated non-response rates and the addition of a take- 
none stratum with a null sample size. This section discusses 
briefly how these additional items are handled in stratifica- 
tion. Non-response needs to be accounted for when opti- 
mizing for n. A take-none stratum makes Y, biased; in this 
case the precision target is specified in terms of a Relative 
Root Mean Squared Error (RRMSE) rather than a CV. 
Formula (4.3) of Baillargeon and Rivest (2009) provides a 
generalization of (1) that includes these two features. This is 
the formula used for calculating sample sizes in the opti- 
mization procedure. 


5.1 Non-response 


Non-response can be corrected a posteriori, by dividing 
the no non-response stratum sample sizes by the response 
rates. This is illustrated in the following R-code that 
considers the MRTS variable, representative of Statistics 
Canada Monthly Retail Trade Survey. Posthoc non- 
response corrections are implemented in the var.strata 
function with the argument rh.postcorr=TRUE. An 
alternative is to consider response rates when allocating the 
sample to the strata. They can be specified in a strata 
function with the argument rh=. This approach penalizes 
strata with a high non-response; it typically yields a smaller 
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n value than the a posteriori corrections. This is illustrated 
in the cum,/f portion of Table 10. With four strata and 
response rates of 0.8, 0.8, 0.9, 1, the a posteriori correction 
needs n= 445 to reach the target CV for the MRTS 
variable, as compared with n = 444 for an allocation that 
takes non-response into account. 
> data (MRTS) 
> cum <~— strata.cumrootf(x = MRTS, nclass = 500, CV = 0.01, 
sy — 4 aloe — 1602 5) Os Oe5)))) 
> cum.var <- var.strata(cum, rh = c(0.8, 0.8, 0.9, 1)) 
> cum.post <- var.strata(cum, rh = c(0.8, 0.8, 0.9, 1), 
rh.postcorr = TRUE) 


> cum_rh <- strata.cumrootf(x = MRTS, nclass = 500, CV = Onn, 
iS aa Lor eo, O02 5)\, che—se(08), On8, (eee) 


Non-response can also be accounted for when construc- 
ting an optimal sampling design, either a posteriori or in the 
stratum construction. These two approaches are imple- 
mented for the MRTS population in the following R-code. 
The higher non-response rates for the small units penalize 
the first stratum which is smaller when non-response is 
accounted for in the stratification algorithm, as can be seen 
in Table 10. Still accounting for non-response in the stratum 
construction gives a smaller n-value than an a posteriori 
correction. Table 3 of Baillargeon and Rivest (2009) pres- 
ents additional examples, including both anticipated mo- 
ments and non-response, of the construction of stratified 
designs for the MRTS population. 


ee Nitel <= eyou@ahicely iisi(G. = IViktuSys OW — (OOH, Tis) = ay 
allog — (0-5, 10), 105) takeall— 7) 
2 Li var <— Var. strata (Gh, ch = e(0.8, 028, .9) 1)) 
2 TH. post <— var.strata(LH, rh = c(0.8, 0.8, 0.9, 1), 
rh.postcorr = TRUE) 
> LH_rh <- strata.LH(x = MRTS, CV 


= 4 
aloo l—ec (Ono, Op Op Sy) GakeaiT — 


pa eth Pom Beales ORS ORD aml) 
Table 10 

Two examples of non-response correction: Either a posteriori (post) 
or when constructing the design 


Method rh 1 2 3 4 n  anticip. 


CV 


cum ff mone oN, 778 742 355 125 
ey Sie B90! 9 88) 195° 390) AR tea 
Ae MOF 113 "98 125" 4451.00 


nm, 105 108 106 125 444 ~~ 1.00 
LH none N, 
Bye fi SOs G0 1778 379 111 
OS GR Age 100 
given N, 675 677 449 199 
Tie in 10.6% 80 199 418 1.00 


5.2 Take-none stratum 


A take-none stratum with a null sample size might be 
advantageous when the population has small units with Y - 
values close to 0. The precision of Y, is then measured by 
the mean squared error, Var(V,) + (7),,/N)?, where Thy 1s 


Statistics Canada, Catalogue No. 12-001-X 


62 Baillargeon and Rivest: The construction of stratified designs in R with the package stratification 


the anticipated Y -total in the take-none stratum. Setting 
takenone=1 in the strata.LH function constructs an 
optimal design with a take-none stratum. Baillargeon and 
Rivest (2009) showed that Sethi’s algorithm does not work 
in this case and that Kozak’s algorithm should be used. 
When a take-none stratum is used, a rough bias correction 
can be implemented by dividing 7, by the proportion of the 
total of the XY variable in the take some strata. Thus the bias 
penalty in the mean square error might be too stringent and 
an alternative measure of precision, such as Var(y,) + 
(p x T,,/N)*, could be used in the stratification algo- 
rithm where p is a number in (0,1). This smaller bias 
penalty can be implemented by setting the argument 
bias.penalty equal to p. The following R-code con- 
structs three optimal stratified designs for the MRTS popu- 
lation, with and without a take-none stratum; the default full 
bias penalty is compared to a reduced penalty with p = 0.5. 


> data (MRTS) 
> notn <- strata. LH(x = MRTS, CV = 0.1, Ls = 3, 
alloc = ie(0.5, 0, 01.5)) 
> tnl <- strata.LH(x = MRTS, CV = 0.1, Ls = 3, 
alloc = c(0.5, 0, 0.5), takenone = 1) 
> €n0.5 <- strata. LH(x = MRTS, CV = 0.1, is = 3, 
alloc = c(0.5, 0, 0.5), takenone = 1, bias.penalty = 0.5) 


The sample sizes n for the three designs are given in 
Table 11. Including a take-none stratum with a full bias 
penalty reduces n, from 22 to 16; for this design the take- 
none stratum accounts for 3% of the total of the X-variable. 
Reducing the biais penalty to p =0.5 increases the size of 
the take-none stratum and reduces n. Additional illus- 
trations are given in Table 2 of Baillargeon and Rivest 
(2009). They show that the size of a take-none stratum 
typically decreases with the target RRMSE. For the MRTS 
example, the addition of a take-none stratum diminishes the 
n-value substantially while for others it does not change the 
design. 


Table 11 
Sample sizes for three optimal stratified designs for the MRTS 
population 


takenone 0 | | 
bias.penalty NA 0.5 
n 22, 16 3 

Vols 0 3 9 


6. Conclusion 


The R-package stratification offers flexible methods for 
the construction of a stratified sampling design using a 
univariate stratification variable such as a measure of size in 
a business survey. Several methods are available to deter- 
mine the stratum boundaries and the stratum sample sizes. 
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stratification allows the investigation of features such as a 
take-all stratum, a take-none stratum, the extent of the 
discrepancy between X and Y, and a stratum specific non- 
response. 
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7. Appendix 
7.1. More details on Kozak’s algorithm 


As described in Section 3.3 Kozak’s algorithm uses a 
random search. Besides decreasing the optimization crite- 
rion, either the 7 -value or the RRMSE of Y,, stratification 
requires that the take-some strata contain at least minNh 
units and that they have positive sample sizes, for the 
new boundary to be admissible. The default is minNh=2. 
A non random, Kozak’s algorithm is also available with 
method="modified" in the algo.control argument. It 
tries all the possible changes at one iteration and picks the 
one that gives the largest drop of the optimization criterion. 
It is slower than Kozak’s algorithm without improving the 
detection of the global minimum of the optimization 
criterion. Therefore, it will not be discussed any further. 

To illustrate the complete enumeration of all possible 
solutions mentioned in Section 3.3, consider the USbanks 
data set. It contains 357 values, but only 200 unique values. 
If one wishes to stratify this population in two strata, only 
($°"') = 199 solutions are possible. The following command 
performs a complete enumeration of the possible solutions: 


> enum <- strata.LH(x = USbanks, CV = 0.05, Ls = 2, 
alloe = e(On5;. 107 .0'5))) 


These solutions, with their associated optimization crite- 
ria value, can be found in enum$sol.detail. Only the 
solutions fulfilling the admissibility constraints mentioned 
above are included in enum$sol.detail. 

When running Kozak’s algorithm, the initial boundary 
values might fail to meet the admissibility constraints; the 
algorithm might not be able to move at all. In such a case, 
the initial boundaries are replaced by robust ones. The 
robust boundaries give an empty take-none stratum if such a 
stratum is requested, take-all strata as small as possible, and 
take-some strata with approximately the same number of 
unique X -values. 
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Consider once again the example of Section 3.2 with the 
UScities data set, where Kozak’s algorithm reached a 
local minimum with the default arguments. With geometric 
initial boundaries, Kozak’s algorithm converges rapidly to 
what appears to be a global minimum. 


> LH init <- strata.LH(x = UScities, initbh = pop2$sbh, 
I —L00; ls — 5, valloe = e(0)..5, 0, 025), takeall = OF 
algo.control = list(rep = 1)) 

> LH_initSiter.detail 


bl b2 b3 b4 opti Step iter run 
ii Mono Jo. OOS OF 0.01444981 0 0 ih 
Z AO Meo OMe ALOT, 0.01435576 2 2 1 
3 Bo SS. OS) 107 0.01434272 =i 10 1 
4 OO SO OSL ON LOT, 0.01432714 =i We 1 
5 HG oSHILSY SESS (ON salty 0.01431013 =2 IES) il 
6 ONO MS 2 oe Oe Ole HOW, 0.01430163 1 63 il 


> LH _init$niter 
{1] 163 


The output element LH initSiter.detail contains 
information about the initial boundaries and the 5 iterations 
with a change of boundaries only. A total of 163 iterations 
were needed for the algorithm to converge. The geometric 
initial boundaries are very close to the optimal solutions. A 
local minimum can also be avoided by changing some of 
the algorithm’s parameters. The following R-code allows 
larger steps (maxstep=20) and increases the maximal num- 
ber of iterations (maxstill=1000) and the number of 
repetitions of the algorithm (rep=20). 
> LH param <- strata.LH(x = UScities, n = 100, Ls = 5, 


alloc —Ve(07no, CO, 0.0), cakeall =O) algo.control = 
list (maxstep = 20, maxstill = 1000, rep = 20)) 


The results for the 20 repetitions are reported in 
LH_param$rep.detail and summarized in Table 12. The 


CV 
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solution obtained with the geometric initial boundaries is 
reached 9 times out of 20. 


Table 12 
Solutions found by Kozak’s algorithm for 20 repetitions 
CV Bl B2 B3 B4 frequency 
0.0143 19.50 32.50 58.00 107.00 9 
0.0167 16.50 23.50 37.50 78.00 5 
0.0167 15.50 22.50 35.50 73.00 6 


Figure 3 shows how larger steps help the algorithm to 
reach the global minimum (CV = 0.0143), compared to a 
run of the algorithm with the default arguments (dotted 
lines, CV = 0.0167). 


7.2 R package stratification summary table 


This appendix provides a quick reference for the R 
package stratification. Table 13 lists the five functions in 
stratification and their arguments. The following notes 
complete the table. 


(1) According to the general allocation scheme (Hidiroglou 
and Srinath 1993). The stratum sample sizes are propor- 
tional to Ni*Y/" S72. 


(2) The default value of initbh is the set of arithmetic 
starting points of Gunning and Horgan (2007), see Section 
3.3. If takenone=1 and initbh is of size Ls-1, the initial 
boundary of the take-none stratum is set to the first percent- 
tile of x. If this first percentile is equal to the minimum 
value of x, this initial boundary would lead to an empty 
take-none stratum. In that case, the initial boundary of the 
take-none stratum is rather set to the second smallest value 
of xX. 


--- Default arguments 
Maxstep=20,maxstil1=1000 


boundaries 


0 100 200 


300 400 


iteration 


Figure 3 Iterations histories for two runs of Kozak’s algorithm 
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(3) The elements to specify in the algo.control argument 
depend on the algorithm. The following table shows the 
elements used by each algorithm and their default values. 
See help(strata.LH) for a complete description of every 
element. 


Algorithm maxiter method minNh maxstep maxstill rep minsol 


Sethi 500 = = = = = = 
Original Kozak | 10,000 “original” a 3 100 3 1,000 
Modified Kozak} 3,000 ‘modified’? 2 3 - 1,000 


(4) The elements of the model.control argument depend 
on the model: 
- loglinear model with mortality: 


exp(a + beta log(X) + epsilon) 
with probability p, 


0 with probability 1—p, 


where epsilon ~ N(0, sig2) is independent of X. 
The parameter p, is specified through ph, ptakenone 
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. heteroscedastic linear model : 
Y=betaX+epsilon 
where 
eps lon.~ NV(Orsig? x). 
- random replacement model: 


Xx with probability 1 — epsilon 
Xnew with probability epsilon 


f= 


where Xnew is a random variable independent of X with 
the same distribution as _X. 


The following table presents model .contro1 default values 
according to the model. 


"loglinear"| 1 0 


"linear" ] 0 - - - 0 - 


and pcertain. 


Table 13 


R package stratification summary table 


"random" 


HY 
pS) 
° 
fo) 
Babin ‘ 
o a a p 
3) a Q © 
® © © o 58) 
v v v v n 
0 «8 © o : 
i i 3 8 3 
argument n na n > description 
e e e ° stratification variable 
° e e ° target total sample size 
CV e e e e target CV or RRMSE 
Ls co) e e e number of sampled strata 
alloc e e ° e allocation specification (1) 
certain e e ) ) x-indices for units sampled with 
certainty 
nclass ° number of bins 
bh ° strata boundaries 


takeall.adjust 


indicator of adjustment for take-all 
strata 


format 


vector 
scalar 
scalar 
scalar 


list (ql, q2,q3) where qi2 0 


default 


none (x is mandatory) 

none (n or CV is mandatory) 
none (n or CV is mandatory) 
3 

Neyman (q1=q3=0.5, q2=0) 


vector NULL (no certainty stratum) 
scalar min(10L, N) 
vector none (bh is mandatory) 


True or False 


takeall e ° number of take-all strata 
initbh ° initial strata boundaries (2) 
algo ° algorithm identification 


algo.control 


algorithm’s parameters specification 
(3) 


FALSE (no adjustment) 


One ofr OMianeaci= 


0 


vector 
"Kozak" or"Sethi" 
list (maxiter, method, minNh, 


maxstep, maxstill, rep) 


equidistant boundaries 
"Kozak" 


depends on algo 


strata ° stratification scheme strata object none (Strata is mandatory) 
y e@ study variable vector NULL (mode1 given instead) 
model e ° ) e e model identification "none", "loglinear", "none" 


model.control 


rh 


rh.postcorr 


takenone 
bias.penalty 
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model’s parameter specification (4) 


anticipated response rates 
indicator of posterior correction for 
non-response 


number of take-none strata 
penalty for the bias 


"linear"* or "random"* > 
list (beta, sig2, ph, 


ptakenone, gamma, epsilon) 


(*unavailable with Sethi’s algo) 
depends on mode 1, but equivalent 
tomodel="none" 


scalar or vector 
TRUE or FALSE 


Oorl 


scalar 


rep (1,Ls) orrh from strata 
FALSE (no correction) 
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Replication variance estimation under two-phase sampling 


Jae Kwang Kim and Cindy Long Yu ' 


Abstract 


In two-phase sampling for stratification, the second-phase sample is selected by a stratified sample based on the information 
observed in the first-phase sample. We develop a replication-based bias adjusted variance estimator that extends the method 
of Kim, Navarro and Fuller (2006). The proposed method is also applicable when the first-phase sampling rate is not 
negligible and when second-phase sample selection is unequal probability Poisson sampling within each stratum. The 
proposed method can be extended to variance estimation for two-phase regression estimators. Results from a limited 


simulation study are presented. 


Key Words: Double sampling; Jackknife; Regression estimator; Reweighted expansion estimator. 


1. Introduction 


Two-phase sampling, first introduced by Neyman (1938) 
and sometimes called double sampling, is a cost effective 
technique in survey sampling. It is typically used when it is 
very expensive to collect data on the variables of interest, but 
it is relatively inexpensive to collect data on variables that are 
correlated with the variables of interest. Two-phase sampling 
has application in different forms (e.g., Rao 1973; Cochran 
1977; Breidt and Fuller 1993; Rao and Sitter 1995: 
Hidiroglou and Sarndal 1998; Fuller 1998; Hidiroglou 2001; 
Fuller 2003). Two-phase sampling for stratification refers to 
the situation where the observation from the first-phase 
sample is used to make a stratification for the second-phase 
sampling. By selecting the first-phase sample for stratifica- 
tion purpose, two-phase sampling is a useful tool when there 
is no sampling frame available for stratification at the 
beginning. For example, in forest surveys, it is very difficult 
and expensive to travel to remote areas to make on-ground 
determinations. However, aerial photographs are relatively 
inexpensive, and determinations on, say, forest type from 
aerial photos are strongly correlated with ground deter- 
minations and can be used to stratify the first phase sample. 

Replication variance estimation is very popular in 
complex surveys. Rust and Rao (1996) and Wolter (2007) 
provide comprehensive overviews on this topic. The repli- 
cation method does not require the computation of the 
partial derivative of the Taylor expansion and the user can 
easily produce variance estimates without knowing the 
sampling design that was used to collect the data. Further- 
more, this tendency is increasing because of confidentiality 
issues (Lu and Sitter 2006). Once the replication weights are 
provided, the design information such as stratum identifier 
is not needed for the user’s analysis. 

There are two commonly used estimators of the popula- 
tion mean under two phase sampling: the double expansion 


estimator (DEE) and the reweighted expansion estimator 
(REE), named by Kott and Stukel (1997). In general the 
REE is more efficient than the DEE in the situation of two- 
phase sampling for stratification when the y’s within a 
stratum are homogeneous. Variance estimation for two- 
phase sampling is a challenging practical problem, and 
replication variance estimation is of interest among practi- 
tioners. Rao and Shao (1992) proposed a consistent jack- 
knife variance estimator for the REE in the context of hot 
deck imputation treating the respondents as the second- 
phase sample. Kott and Stukel (1997) considered the same 
problem and concluded that the jackknife variance estimator 
works well for the REE if the first-phase sampling rate is 
negligible. The sampling rate, or the sampling fraction, 
f, = nN" is called negligible if jf, converges to zero 
under the asymptotic setup described in Section 2. Binder, 
Babyak, Brodeur, Hidiroglou and Jocelyn (2000) studied 
variance estimation for a similar two-phase sample design 
using the Taylor linearization method. Kim ef al. (2006, 
KNF) provided a rigorous investigation of the replication 
method and considered replication for other types of 
estimators. The KNF method has been developed mainly 
under the situation where the first-phase sampling rate is 
negligible and the second-phase sampling is a stratified 
random sampling. If the first-phase sampling rate is not 
negligible, additional replicates are needed to get 
consistent variance estimates. 

In this paper, we propose a new replication method for 
variance estimation under two-phase sampling. The pro- 
posed method is an extension of the KNF method to cover 
the situation where the first-phase sampling rate is not 
necessarily negligible. Unlike the KNF method, the pro- 
posed method does not require additional replicates for bias 
correction in the variance estimation, but does require adjust- 
ments in the replication weights. Also, the proposed method 
is applicable to unequal probability Poisson sampling within 


1. Jae Kwang Kim, Department of Statistics, lowa State University, Ames, Iowa 50011, U.S.A.; Cindy Long Yu, Department of Statistics, Center for 
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68 Kim and Yu: Replication variance estimation under two-phase sampling 


second-phase strata, which was not discussed in KNF. 
Because the proposed method is a replication-based method, 
it is very easy to implement and can be applied to various 
types of estimators. 

The rest of the paper is organized as follows. In Section 
2, the basic setup is introduced, and in Section 3, the 
proposed method is described. In Section 4, the proposed 
method is extended to other estimators in two-phase 
sampling. In Section 5, results from a limited simulation 
study are presented. Concluding remarks are made in 
Section 6. 


2. Basic setup 


For better motivation, in this section we simply assume 
the situation where the first phase is a simple random 
sample of size n from a finite population of size N and the 
second phase sampling is a stratified random sample. In 
section 3, the setup is extended to include any arbitrary 
measurable sampling in the first phase and unequal 
probability Poisson sampling within each stratum in the 
second phase. Using the information obtained from the first- 
phase sample, it is stratified into H strata for second-phase 
sampling. In stratum h, we have n, first-phase sample 
elements and let 4,, be the set of indices for the first-phase 
sample elements in stratum /. In the second-phase 
sampling, a stratified random sample of size r is selected 
with sample size r,(< m,) in stratum h, where r = Dj4,1, 
and the sampling rate 1,/m, is fixed for each stratum. To 
formally discuss the asymptotic theory, we assume a 
sequence of finite populations, a sequence of first-phase 
samples, and a sequence of second-phase samples, as 
described in KNF. In this asymptotic setup, we allow that 
the second-phase sample size r goes to infinity at the same 
rate as the first phase sample size n, i.e., r = O(n) and 
r | = O(n"), and H is fixed. Thus, in the setup of fixed 
He OG) 

When the study variable y, is observed in the second 
phase sample, the population mean of y is estimated by 

1 H 
Uae 2 oP 
where 4,, is the set of indices for the second-phase sample 
elements that belong to stratum h. The variance of y,, can 
be written as 


a a he eV sre 
Var(y = [4-2 ]s% [+ es 
Yip n N 2 n lh, ny, ee 
where y= ee ye Nn ye 


ee =I =e’ tet 2) 
Sh (m, — I) dieg, i Van) and yy =, Dies, Ve 
Using 
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H 

-lq2 - =] — A? 2 

nS = Bn ln — WY + sil 
h=1 

where w, =n 'n, and = indicates an approximation 

ignoring the terms of order o(n'), the variance term (1) is 

approximated by 


H 
Var (¥,, ) = elma —f,) >, (Vn =) 
h=| 
H ne 
+ = ty AM, sf (2) 
h=l 
where f, = nN”. 


A consistent estimator of the variance of Yy,, can be 
derived from (2) by replacing 7;,, and s,, by their ans 
2a =i 2 =i = 

Vin = Ty Qieg, ¥ ON So=—, — 1) Dien Grays 
respectively. That is, a consistent variance estimator 1s 


H 
Ve n'(1 aie) yw a2 3 vy) 


=m, A, Sis B) 


where VY, = YjL1 W, Vyo. The variance estimator (3) is a 
linearized variance estimator. 

Kott and Stukel (1997) and KNF developed a jackknife 
variance estimator by successively deleting units from the 
entire first-phase sample and then adjusting the weights. The 
full jackknife replicates are 


H 

(pee l e (k) —(k) 

2 nay ae N ni Yh2 (4) 
h=l 


where k is the index of the unit deleted in the jackknife 
replicate, 


I ~(k) (k) 
aa = W: 
N Nin 2 i 
_|@-1)'@,-1) ifk e A, 
\(n -1)'n, ifk ¢ A, 


and 


> why 
<a (Ki) ee igh bet 


Vio Ss wh) 
1eApy  * 
= Ce (4, Vn2-Yy) if ke A, 
Vn2 if kK€A,). 


The full jackknife variance estimator of the form 


r = F 
iL, a baat ae 


keA, n 


(5) 


U9) 


where y,," is defined in (4), is asymptotically equivalent to 


H 
Vie pe (l= ti) > Wy (Vir — i 
h=| 


H 
+(l-f,) > 4 Wi; Spo. (7) 
h=1 
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Thus, comparing (7) with (2), the bias of the jackknife 
variance estimator (6) is 

> . 1 iis 2 

Bias(V;)=—-Eyf, > (% —n,)) s7)'. 
h=| 

Therefore, if the first-phase sampling rate is negligible in the 
sense of f, = 0, the bias is negligible, ie, the bias = 
o(n'). Otherwise, the variance estimator underestimates 
the variance. 


To consider a bias-corrected jackknife method, instead of 
(5), we consider 


(%, = One Dye On); ) if keA,, 


= (k) 
Yn2 “ 
Vex if kz A,,, 
where 6,, is to be determined. In (5), 5, = 1 was used. The 
jackknife variance estimator using (8) instead of (5) is 
asymptotically equivalent to 
H 
V,= (Cee i) py Wal Vin, ve) 
h=| 
D8 
+d 4) VD ye oe, 
h=| Wo oy) 
Thus, the asymptotic bias is 


Bias(V,) = 


(%, a 1) 8, : 1 Ny PR) 
is aS a pe AL a 
el i es (7, a a) Al fi n, je + 


The asymptotic bias is zero if 


7 


1+ ¥7%,(% — Did, 


5, = 


where d, = /(1— f.n,7,')/(—f,). Hence, with such 
determined 6, in equation (8), the resulting jackknife vari- 
ance estimator is approximately unbiased without assuming 


o> 0. 


3. Proposed method 


The proposed method in Section 2 is now extended to a 
more general first-phase sampling design. To do this, we 
need to assume that the replication variance estimator of the 
form 


where 0 = Dies, WY; Vir and 6 = pa. w) y;, 18 consistent 
for the variance of 6 under the single (first) stage sampling 
design. That is, 
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A 


Yi 
een) 9 
Var (8) p8) © 


Here ZL is the number of replicates. For most of the 
measurable designs, which are designs with all positive joint 
inclusion probabilities, we can construct a replication 
variance estimator satisfying (9) even when the sample rate 
f = nN is large. For example, see Fay (1984) and Flyer 
(1987). Brick and Morganstein (1996) describes the basic 
algorithm for WesVar, a commercially available software 
for replication variance estimation in survey sampling. 

In this section, we also consider a more challenging case 
of stratified unequal probability sampling for the second 
phase. More specifically, the second phase sampling 
considered is unequal probability Poisson sampling within 
the second-phase strata. Fuller (1998) also considered 
Poisson sampling in the second phase and argued that 
Poisson sampling in the second phase sampling is a good 
approximation. An example of this in the context of forest 
surveys is that, in addition to forest types, the photo- 
interpreters can also identify tree density and tree height 
from the aerial photos taken in the first phase, which can be 
used to construct the second phase selection probabilities 
within each stratum (forest type). 

In this section, we will focus on the REE-type estimator 
first since it is more efficient than the DEE-type, and 
extension to the DEE is discussed in Section 4. Let w, be 
the first-phase sampling weight and let w,, be the inverse of 
the conditional probability in the second-phase. That is, 
Woe aWherent,, — Pig ¢ 44j/7e 4.) The REE- 
type estimator can be written as 


Wee 
= — > Nin Vino (10) 
h= 


where Nin= Dies, and Vrr= (ied,WMi2) Lies, WiTin V)- 
In KNF, 7,, is assumed to be constant within the second- 
phase stratum. 

We consider a replication-based approach for variance 
estimation of the REE-type estimator (10) when z,, is not 
necessarily constant within the second-phase stratum. We 
consider the special case when the second-phase sampling 
design is Poisson sampling. Using the replication method 
satisfying (9), the KNF-type variance estimator can be 
applied to estimate the variance of Y,, in this situation. That 
is, 


ae = Vn ae) (11) 


where 


ae = 2 Nee my (12) 
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with ise (ies We tie) ae Ta yacand NO 
Dies, W, and ¢, is a factor associated with replicate k 
determined by the replication method. Under Poisson 
sampling in the second phase, we have the following 
asymptotic bias: 


pete. 1 < i 
Bias Ving) = ae ye m5(1—7,.)(y, -¥,)°, (3) 

h=1 ieU, 
where U, is the set of indices of population elements in 
stratum h and Y, = N,' Dieu, ¥;. A sketched proof of (13) 
is presented in Appendix A. 


An asymptotically unbiased estimator of the bias (13) is 


Visas a 


Ze — Tin) VY; — Dra yy. (14) 


Ww? h=1 i€A,> 


The bias is negligible if n/N = 0. Thus, we can safely 
ignore the bias of the KNF-type variance estimator when the 
first-phase sampling rate is negligible. The bias can be 
arbitrarily large if the first-phase sampling rate n/N is not 
negligible. KNF also discuss a bias-correction replication 
method using additional replicates, which can lead to a large 
number of replicates. Creating additional replicates for bias- 
correction can be cumbersome for large scale surveys. 

We consider an alternative bias-corrected replication 
variance estimator that does not require creating additional 
replicates. To develop a replication-based bias-corrected 
variance estimator, define a random variable 


indep 


6,; ~ Bernoulli(p, ), (15) 


where p, is to be determined. Let 


L 
or" ee —*(k) — 2 
Ving = LF = Vrs) (16) 
c=] 
where 
®_ 1S gw oH) 
Vs, =—) Mi Vn2 (17) 
N h=| 
with NY) Daeds wi, 
(k) Uy 
—*(k) Pah Fe Ww; M;; M5 Vi 
Vin re em Se CO 
Ds _ We Mi Tp 
ee) 
with 
k e 
Miz’ =1+ Bu — Ped (19) 
and b, is also to be determined. By construction, 
E,(8,; — P,) = 9, where E, denotes that the expectation 


is taken with respect to the mechanism in (15). Thus, the 
replicates (18) create additional variation in the replication 
weights, where the additional variation in (18) comes from 


Statistics Canada, Catalogue No. 12-001-X 


the distribution (15). A suitable choice of p, and b, can 
make the resulting variance estimator consistent. 

Under the regularity conditions discussed in KNF, we 
have 


H 
ee , 3 Neh) a5 1 
E, Vee) = Vans +N YS wy by M9 UY; —Yy2) 


h=| ie€A, 
Fine (n"'), (20) 


where u = >j_\c, p,(1 — p,). A sketched proof of (20) is 
presented in Appendix B. If 5, are determined by 


ear (bee Or, (21) 


the variance estimator (16) is consistent because the second 
term in (20) cancels out V,,,,, in (14). This is true even when 
the first-phase sampling rate n/N is not negligible. To 
guarantee nonnegative replication weights in (18), we 


require that b, in(19) is < 1. If we set p, =0.5, then 
4(1-1,,)w, 


u 


b= 


I iE ? 
pa; Cf 


which is less than or equal to 1 if /{_,c, = 4. In fact, the 
p,s can be chosen to be any number between 0 and | as 
long as the resulting 5, in (21) is less than or equal to 1. 


4. Extensions 


In this section, we consider some extensions of the 
proposed replication method to types of two-phase esti- 
mators other than the REE in (10). 


4.1 Double expansion estimator 


In two-phase sampling, the double expansion estimator, 
termed by Kott and Stukel (1997), is also used. The double 
expansion estimator (DEE) has the simple form 


= +> > w, i Yj. (22) 


Nee 1 i€A,, 


/ VEE 


When the second-phase sample is a stratified random 
sample, ,, = 7,/n, and the KNF method can be applied 
using the replicate 
H (*) 
oe th SI pire cH apell »: wy 
YDEE — N 4) 4 
hel Pahek! Mi fe Ay, 


The KNF variance estimator for DEE is consistent when the 
first-phase sampling rate is negligible. When the first-phase 
sampling rate is not negligible, we can use the replication 
method proposed in Section 3. The proposed replication 
method for the DEE creates replicates, 
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Vous = — +> et wis ys (23) 


=1 ie€A,, 


where 


(CS! 
) 
De Any Mg ae 
(k) (k) 
ye Ano ey Ww, i 


and M‘ is the replication factor defined in (19). The bias 
of the replication variance estimator using replicate (23) is 
negligible if the replicates are constructed to satisfy (21). 

If the second-phase sample is an unequal probability 
sample within each stratum, the replication method such as 
(23) is not directly applicable. The DEE in (22) is generally 
less efficient than the REE in (10). Note that the REE in 
(10) can also be expressed as 


Co) ame (Kk) 
Ww = Mi 


W; Wir Vis (24) 
where 


(25) 


The replicates (17) can be written 


wely Sy wi* ory Vi» (26) 


h=1 ieA,, 


where 
(27) 


and M'\? is defined in (19). 


4.2 Regression estimator 


In two-phase sampling, auxiliary variables that are 
observed in the first-phase sample can be further used at the 
estimation stage. The two-phase regression estimator of the 
population total can be written in the form 


(Uae (28) 


where T., = Dic 4, W,X; is the vector of estimated population 
totals of the control rae x SSUES with the first-phase 
sample and Bee = (Lies, WWiX)X})" es WiWrX,y, iS a 
vector of estimated regression pooticiene estimated with 
the second-phase sample and w’, is given by (25). Note that 
the regression estimator in (28) can incorporate the stratified 
sampling design in the second-phase if x, includes the 
vector of stratum indicators. 

Using the arguments of Section 3, the k™ 
Y, arg can be constructed by 


t, 


Je REG — 


replicate for 
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> (k) Wk) Q(k) 
Y REG era B; > (29) 
where 


p(k) __ (k) 
Ty - »; MA Xx; 


ie A, 


BY) = Wows x! Pa wit? xy 
and w’” is defined in (27). 

The replication method (29) can be directly applicable to 
the two-phase calibration estimator that was discussed in 
Hidiroglou and Sarndal (1998). If H = 1, then the replicate 
of B, in (29) reduces to 


“> wf? 


ieA, 


Mere xiye 


Ai) Cay ry). =i 
p>’ = p we M;5 1; ;X; 


ie A, 


5. Simulation study 


To study the finite sample performance of the proposed 
estimators, we conducted a limited simulation study. In the 
simulation, we first generated an artificial finite population 
of size N =1,000 with five variables (z,, Gk Ve). 
where the population elements are independently generated 


from 2,~ exp(1) + 2;q,~ y (1) + 2; x, — N(2, 1); u, ~ 
Unif {1, 2, 3, 4}, where Unif fl, ...,G} denotes a discrete 
uniform distribution with support 4, oe and 


Vi= By eBoy Boze Oh Fe, 


Witllan( Bee e555 )e = (Up en lel) eand e = N(0, 1). “The 
variables z,, q;, x,,u;, and e, are mutually independent. 
The stratum for the second-phase sampling was defined 
using variable uw, Variable x, was used to compute the 
two-phase regression estimator (28) with x, = (1, x,)’, 
variable z, was used as a size measure for the unequal 
probability sampling in the first phase sampling, and 
variable q, was used as a size measure for the unequal 
probability sampling in the second phase sampling. 

To obtain unequal probability samples for this simulation 
study, we used either Poisson sampling or Rao-Sampford 
sampling (Rao 1965 and Sampford 1967), with selection 
probabilities proportional to the measure of the size 
variable. Note that the final sample size is random under 
Poisson sampling but is fixed under Rao-Sampford 
sampling. 

The simulation setup employed a 2x 3.x 2 factorial 
structure with three factors. The factors are 

1. Sampling for the first-phase sample (2): Simple random 

sampling of size n=200 versus the Rao-Sampford 
sampling of size n=200 using z, as the measure of 
size. 
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(2 


2. Sampling for the second-phase sample (3): Stratified 
random sampling of size 7, =25, stratified Poisson 
sampling with expected sample size r, =25 using q; 
as the size measure for the unequal probability sam- 
pling, and stratified Rao-Sampford sampling of size 
, =25 using q, as the size measure for the unequal 
probability sampling. 

3. Variance estimation methods (2): The KNF estimator 
(11) without additional replication versus the proposed 
variance estimator using (16) were computed based on 
the jackknife method. 


From the finite population generated above, we gener- 
ated B=5,000 independent Monte Carlo samples for 
simulation. For the designs with Rao-Sampford sampling in 
the first phase, we used the jackknife variance estimation 
method proposed by Berger (2007), which gives a 
consistent estimator of the first phase sampling variance. 
The parameter of interest is the population mean of the y 
variable. From each Monte Carlo sample, we computed two 
point estimators, the REE in (24) and the regression 
estimator (REG) in (28) using the auxiliary variable (1, .;). 
Relative biases of the variance estimators were computed by 
dividing the Monte Carlo bias of the variance estimator by 
the Monte Carlo variance of the point estimator. 

Table 1 shows the mean and variance of the two point 
estimators. For point estimation, the regression estimator is 
significantly more efficient than the REE for this population 
because the auxiliary variable x is correlated with the study 
variable y. The theoretical asymptotic variance of the 
regression estimator under simple random sampling in the 
first phase and stratified random sampling in the second 
phase is approximately equal to 


wey! 8+ (2-4 = 0.052 
200 1,000 100 200 


and the theoretical asymptotic variance of the REE under 
the same design is, approximately, (1/100—1/1,000) 8 = 

0.072, which is consistent with the numerical results in 
Table 1. The Rao-Sampford sampling in the second phase is 
slightly more efficient than the Poisson sampling because of 
the fixed sample size in the Rao-Sampford sampling. 

Table 2 shows the relative bias (RB) and coefficient of 
variation (CV) of the two variance estimators. Relative 
biases of the variance estimators were computed by dividing 
the Monte Carlo bias of the variance estimator by the Monte 
Carlo variance of the point estimator. Coefficients of varia- 
tion of the variance estimator were computed by dividing 
the Monte Carlo standard error of the variance estimator by 
the Monte Carlo average of the variance estimator. 
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Table 1 
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Mean and variance of the point estimators (5,000 samples) 


reweighted expansion estimator (23), 


Second-Phase Mean Variance 
Sampling 

St. SRS 10.0 0.0749 
St. Poi 10.0 0.0784 
St. RS 10.0 0.0754 
St. SRS 10.0 0.0768 
St. Poi 10.0 0.0827 
St. RS 10.0 0.0781 
St. SRS 10.0 0.0540 
St. Poi 10.0 0.0510 
St. RS 10.0 0.0495 
St. SRS 10.0 0.0551 
St. Poi 10.0 0.0531 
St. RS 10.0 0.0515 


Estimator _ First-phase 
Sampling 
REE SRS 
RS 
REG SRS 
RS 
REE: 
REG: _ regression estimator (27), 
SRS: Simple random sampling, 
RS: 


St. SRS: Stratified simple random sampling, 


Rao-Sampford sampling, 


St. Poi: Stratified Poisson sampling, 
St. RS: Stratified Rao-Sampford sampling. 


Table 2 


Relative bias (RB) and coefficient of variation (CV) for the 


variance estimators (5,000 samples) 
Method Estimator First-phase Second-Phase RB (%) CV(%) 


Sampling Sampling 
KNF REE SRS St. SRS =e25e 8:22 
St. Poi -9.56 18.67 
St. RS -7.75 15.35 
RS St. SRS -8.05 18.61 
St. Poi -9.03 20.84 
St. RS -5.73 VT 
REG SRS St. SRS -6.76 22.32 
St. Poi -6.06 15.81 
StuRS -3.26 L2ES2 
RS St. SRS 417 21.74 
St. Poi -3.64 16.92 
St. RS -3.20 13.78 
New REE SRS St. SRS 0.09 18.23 
St. Poi -1.23 19.70 
St. RS -0.04 16.06 
RS St. SRS 0.78 19.78 
St. Poi -2.07 21.26 
St. RS 1.00 17.67 
REG SRS St. SRS -0.61 22.00 
St. Poi -0.57 16.55 
St. RS -0.08 13.36 
RS Sisk 0.67 22.86 
St. Poi -0.01 16.97 
St. RS 0.59 14.02 
KNF: Kim etal. (2006) variance estimator without additional 
replicates for bias correction, 
New: _ the proposed variance estimator (16), 
REE: reweighted expansion estimator (23), 
REG: regression estimator (27), 
SRS: Simple random sampling, 
RUS: Rao-Sampford sampling, 
St. SRS: Stratified simple random sampling, 


St. Poi: Stratified Poisson sampling, 
St. RS: Stratified Rao-Sampford sampling. 
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In this simulation, because the first-phase sampling 
fraction is not negligible (n/N =0.2), the KNF variance 
estimator without additional replicates underestimates the 
true variance and the proposed variance estimator estimates 
the variance with smaller bias, less than 3% in absolute 
values in all cases, which is consistent with the theory in 
Section 3 and Section 4. The absolute value of the relative 
biases in the KNF variance estimator are big because, 
although in (29) the variance due to My is consistently 
estimated, the variance due to f, is underestimated without 
additional replicates. The relative biases in our proposed 
variance estimator are reduced because replicates (18) create 
additional variation in the replication weights through addi- 
tional perturbation 5, drawn from a properly chosen distri- 
bution. The proposed variance estimator shows slightly 
bigger CVs than the KNF method because it involves extra 
randomness due to generating 5,, from (15). 


6. Concluding remarks 


Replication variance estimation under two-phase sam- 
pling is an importance practical problem in survey sampling 
and the KNF method is a useful tool in this direction. In this 
article, we propose an extension of the KNF method in that 
it can be directly applicable when the first-phase sampling 
rate is non-negligible, without increasing the number of 
replicates. The proposed method is also applicable to 
unequal probability Poisson sampling within each stratum in 
the second-phase sample. Although the theory has been 
developed only under Poisson sampling in the second phase, 
the simulation results in section 5 show that the proposed 
method works reasonably well for other unequal probability 
sampling designs, such as the Rao-Sampford sampling 
design. Since the proposed replication method provides 
consistent variance estimators for population means, it can 
be readily applied to other finite population parameters 
which are smooth functions of population means. 

In some large scale surveys, the number of replicates can 
be quite large because it uses the same number of replicates 
for the first-phase sample. If one wishes to reduce the 
number of replicates further, the method of Fuller (1998) or 
Kim and Sitter (2003) can be considered. Further investi- 
gation in this direction will be a topic of future study. 
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Appendix 


A. Proof of (13) 


Let a = (a, ..., ay) where a, is the extended version of 
the second-phase sampling indicator as discussed in Kim 
et al. (2006). That is, a, =1 if unit 7 is selected for the 
second-phase sample once it is in the first-phase sample and 
a, = 0 otherwise. 

By assumption (9), conditional on a, we have 


y Cy (Vno — Var 


Thus, the bias of Y/_,c,(¥,5’ — ¥,,)° as an estimator for 
Var (¥,2) is then equal to, ignoring o(n”') terms, 


= Var (J,.|a) + 0,(n'). 


E {Var (9, faye Var(¥,.) = Var {E(V;,2| a)}- 


Using the extended definition of a,, we have 
=1 
a Tin q; MV 
= 
Dane Tl q; 


and, by the Poisson sampling assumption of a, ’s, 


E(¥,2| a) = 


Me a erent pa = 


we > 2 - ty), -¥,)% + 0(N). 


j I 
ieU, 


(A.1) 


Thus, the bias of the KNF variance estimator is of the form 
(13) under the Poisson sampling assumption of a. 


B. Proof of (20) 


For each k, 


SO) 5 oth) _ =, o) _ 
Vip — ips >) Vip ~ Vip a Vip Vip» 


where Be is defined in (12). Thus, 


*(k) _ —(k) 
Vip Ree — Vip ) 


HE 
=(k) 
+ my Cy om — 
k=l 


ak eo aie ve Des (B.1) 
By the construction of 9,., we have 
AY, = ye OT), (B.2) 


Also, writing q,, = Mj) -1, we have q,, = On) 
and we can apply a Taylor expansion to get 
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(a4 peu) 
—*(k) _ <(k) evs We Tir Vis Vi — Vir) 


Vig. a pare (os (B.3) 
Dan Wy Tp 


+0,(1'). 


Also, because 


l 1 
(kb) 1, ree ob -1 
= > nt ee ae ps W, 1, Z, = O(n’) 


h ie Ay> h ieA,, 


for any z variable with bounded fourth moments, it can be 
shown that (B.3) reduces to 


z = 
ee W;Ti2 Vi (Vi — Vaz) e 
Sa Sa Oe). 


1 
ee W; Tip 


Hence, we can write 


L > 
=) te 
DUGhOs San) tp ) a 
k= 
= 


L H i 
Safed Dy W,TE:5 Jy; (V; — 2) + ot): (B.4) 
k=l 


h=1 ie A,» 
Inserting (B.2) and (B.4) into (B.1), we have 
Bi Wexe) = Vise 


H 


] L a) 2 = 9! 
+a Diee DDH ECG) Ra (Ys Tra) 
k=l 


c= h=1 ie A,> 
“| 
Ot), 


and because E.(q@) =p, (hi pe) bs. we have (20). 
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Cost efficiency of repeated cluster surveys 


Stanislav Kolenikov and Gustavo Angeles ' 


Abstract 


We analyze the statistical and economic efficiency of different designs of cluster surveys collected in two consecutive time 
periods, or waves. In an independent design, two cluster samples in two waves are taken independently from one another. In 
a cluster-panel design, the same clusters are used in both waves, but samples within clusters are taken independently in two 
time periods. In an observation-panel design, both clusters and observations are retained from one wave of data collection to 
another. By assuming a simple population structure, we derive design variances and costs of the surveys conducted 
according to these designs. We first consider a situation in which the interest lies in estimation of the change in the 
population mean between two time periods, and derive the optimal sample allocations for the three designs of interest. We 
then propose the utility maximization framework borrowed from microeconomics to illustrate a possible approach to the 
choice of the design that strives to optimize several variances simultaneously. Incorporating the contemporaneous means 
and their variances tends to shift the preferences from observation-panel towards simpler panel-cluster and independent 
designs if the panel mode of data collection is too expensive. We present numeric illustrations demonstrating how a survey 


designer may want to choose the efficient design given the population parameters and data collection cost. 


Key Words: Longitudinal study; Cluster samples; DHS; NHIS. 


1. Introduction 


To analyze the dynamics of social, behavioral or popu- 
lation health phenomena, researchers and policymakers 
need to obtain information on characteristics of the 
population on multiple occasions. Complex design surveys 
are the most frequently used sources of information for large 
populations, such as a country as a whole. Besides the 
standard considerations in single-shot surveys, e.g., stratifi- 
cation and clustering, other issues may be important in 
surveys collected over two or more time periods. In such 
surveys, the total cost and the total survey error are affected 
by an overlap among consecutive samples, (informative) 
sample attrition, time-in-sample or conditioning effects, and 
other dynamic factors. 

For the purposes of estimation of change from repeated 
surveys, it is often desirable to have high temporal corre- 
lation of the observation units which can be achieved by 
administering the survey to the same sampling and/or 
observation units. In longitudinal surveys, the same obser- 
vation units (individuals, households) are revisited for 
several periods, potentially indefinitely many periods (the 
US Panel Study of Income Dynamics (PSID), British 
Household Panel Study (BHPS) and others). A compendi- 
um of information on the longitudinal studies can be found 
at the Institute for Social and Economics Research web site, 
http://iser.essex.ac.uk/ulsc/keeptrack/index.php). In rotating 
panel surveys, the observation units are recruited into the 
sample for a few periods, then rotated out of the sample, and 
surveyed again at a later time. Examples of rotating panel 


surveys include the US Current Population Survey (CPS) 
(Binder and Hidiroglou 1988, Eckler 1955, Rao and 
Graham 1964) and a number of environmental surveys 
(Fuller 1999, McDonald 2003, Scott 1998). Yet another 
option is to use the same primary sampling units (PSUs) in 
different waves, but sample the observation units (secondary 
sampling units, SSUs) independently. Surveys collected in 
this way include international Demographic and Health 
Surveys (DHS) and the US National Health Interview 
Survey (NHIS). 

We shall concentrate on surveys collected in two time 
periods, or waves, using a two-stage cluster design in each 
wave of data collection. We consider three possible designs 
differing in the amount and depth of overlap of sampling 
units over time. The sample designer can simply ignore any 
possible effects arising from the sample overlap, and take 
two independent samples in two periods of time. We shall 
refer to this design as the independent design. Alternatively, 
the sample designer may find it beneficial to recycle the 
PSUs from one wave to another. If the designer finds it 
difficult to track the SSUs from one wave to another, the 
subsamples within clusters can be taken independently in 
two waves of data collection. We shall refer to this design as 
the cluster-panel design. If an utmost precision is essential, 
the fully longitudinal design will attempt to locate all 
individuals who responded in the first wave, and solicit the 
second interview. To distinguish this design from the 
cluster-panel design, we shall refer to it as the observation- 
panel design. 
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A particular aspect that we found important in survey 
management, but underaddressed in the existing literature, is 
the implementation cost (Groves 1989). The traditional cost 
models such as those used in derivation of Neyman- 
Tchuprow optimal allocation design (Neyman 1938) can be 
extended to include terms related to the cost of the first visit 
to the cluster and ultimate observation unit, as well as the 
cost of consecutive visits. The cost of revisiting the cluster is 
likely to be lower on the second occasion. There is no need 
to create new maps and set up frames. The same interview- 
ers can be used to conduct interviews in subsequent waves 
of data collection. Cooperation with community leaders has 
been established earlier, if it is important, as it is in some 
traditional societies. The effect of the panel mode of data 
collection at the individual level is less clear. If the 
household that was interviewed in earlier waves moved out 
and would have to be located, possibly in different geo- 
graphic area, the (average) cost of the panel interview goes 
up. The likelihood of such circumstances increases with 
longer intervals between surveys typical for the developing 
countries surveys: the intervals between waves of DHS are 
usually about 5-7 years. On the other hand, if a less 
expensive interview mode can be used after the first round, 
(e.g., a phone interview instead of the personal visit), the 
cost of the panel interview goes down. 

This paper brings together statistical and economic 
considerations in the choice of the appropriate design and its 
parameters. We assume the survey designer can be inter- 
ested in estimating the change in the population mean 
between two time periods, and/or the means themselves. We 
introduce a sketchy population in Section 2, and compute 
the design variances of the means and their differences for 
the three sampling designs of our interest. 

To incorporate economic aspects of data collection, we 
introduce a relatively simple cost model for a repeated 
cluster survey in Section 3. We set up and solve opti- 
mization problems to obtain the optimal sample sizes for the 
three considered designs. By plugging in the estimates of 
the statistical parameters (variances and autocorrelations) 
and cost components (cluster-level and individual-level 
costs), the survey designer can compare the numeric values 
of the variances to choose the best design. Section 4 
illustrates this approach and shows that each of the designs 
may be the best one, depending on the parameter values. 
The intuitive results (e.g., the higher cost of data collection 
and lower autocorrelations of the observed characteristics 
make panel modes of data collection less appealing) are 
given an analytic justification and quantitative backing. 

While Sections 2-4 deal with the efficiency in estimating 
the difference in means only, more realistic goals of data 
collection efforts would include contemporaneous char- 
acteristics and their variances. To this end, Section 5 


Statistics Canada, Catalogue No. 12-001-X 


introduces a utility maximization framework describing the 
survey designer’s choice of the sampling scheme. This 
framework provides an aggregated objective function that 
combines several design criteria. The results are again as 
expected: if the more expensive panel modes of data 
collection result in smaller sample sizes, the estimates of the 
means are less efficient than in simpler designs. The only 
way to justify these efficiency losses is by a drastic 
improvement in the estimation of the difference that can 
only occur with higher autocorrelations. Such effects are 
also illustrated in Section 5. Section 7 concludes. Proofs are 
given in the Appendix. 


2. Design variances 


Let the population consist of N clusters, or PSUs, in 
both time periods, and each cluster consist of M_ indi- 
viduals, or SSUs. Out of these, an SRS of 1<n, < N 
clusters is taken at time ¢ = 1, 2, and an SRS of 
1 < m, < M individuals is taken in each cluster that is 
present in the sample at time f. Let the index 7 denote 
PSUs, and the index 7, SSUs. Thus the typical measure- 
ment will be denoted as Y,, in the population, and y,, in 
the sample. The population totals 7'[-] and their estimates 
t[-] can then be found as follows: 


cluster total: 


1 


M M 
ely ie De tly] = ea 
iF 


j=1 


population total: 
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The variance of Y and its within- and between-cluster 
components are 


er (2.3) 
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(2.5) 


The characteristic of primary interest is the change in the 
means, 
(2.6) 


estimated by 


20 Sim, (2.7) 


be 


An attractive property of this estimator for analysts and data 
users is its internal consistency: the estimator of the 
difference is the difference of the estimators. If the samples 
in consecutive periods overlap only partially, then compos- 
ite or GLS estimators (Fuller 1999, Hansen, Hurwitz and 
Madow 1953, Patterson 1950, Rao and Graham 1964, 
Wolter 2007) have better efficiency. 

In what follows, we assume all sampling procedures to 
be simple random sampling without replacement. For the 
contemporaneous mean, the variance is given by (Cochran 
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For simplicity and clarity of exposition, we shall often be 
making an assumption of symmetric conditions: 


(2.8) 
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Analytic derivations are possible without these assumptions, 
but become extremely cumbersome. Besides, it is unrealistic 
to think that the survey designer could know the charac- 
teristics of the future population. Thus (2.9) should be 
viewed as a reasonable working model. 


2.1 Independent design 


Proposition |. Let n, out of N clusters and m, out of M 
observation units in selected clusters be taken without 
replacement at time t = 1. Let n, out of N clusters and 
m, out of M observation units in selected clusters be taken 
without replacement at time t = 2, with sampling per- 
formed independently from that at time t = 1. Then 


2 2 
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The result follows immediately from (2.8) by inde- 


pendence of the two samples. The subindex of the variance 
. stands for the “independent design”. Under the symmetric 


(2.10) 


ae 


conditions of (2.9), if the sample sizes are the same in two 
periods, n, = n, = n and m, = m, = m, then 
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where the subindex e,t stands for “equal variances, inde- 
pendent design”’. 


ee (1-4 (2.11) 


2.2 Cluster-panel design 


Proposition 2. Let n out of N clusters be sampled without 
replacement in the first period and be used in both time 
periods. Let m out of M observation units be sampled 
without replacement independently in two periods. Then 
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Here, subindex c stands for the “cluster-panel design”, 
and p' is the intertemporal correlation, or autocorrelation, 
of the cluster means. The superscript I denotes the first 
stage of sampling. If p' is positive, then the cluster-panel 
design is more efficient than the independent design for 
fixed values of n and m. Under the symmetry conditions, 
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where the subindex e,c stands for the “equal variances, 
cluster-panel design”. 


(2.13) 


2.3. Observation-panel design 


Proposition 3. Let n out of N clusters and m out of M 
observation units be sampled without replacement in the 
first period and be used in both time periods. Then 
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Subindex o stands for the “observation-panel design”. 
Under the assumption of symmetric conditions, 
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with corresponding e,o subindex for the “equal variances, 
observation-panel design”. 

Here, p' is the intertemporal correlation, or auto- 
correlation, of the individual observations within clusters. 
The superscript II stands for the second stage of sampling. 
If p" is positive, then the observation-panel design is more 
efficient than the cluster-panel design for fixed values of n 
and m. 

How are the two autocorrelations that appear in (2.15) 
related? Conceptually, one can think of any number of 
possible relations between them. Let us introduce a super- 
population model 


Y,=u,+a, +6, E.la,|— 0, zie] = 0; (C16) 
in which a, and €,, are independent of one another for all 
s,t =1, 2. The subindex € stands for the superpopulation 
model expectations. The case of p' = 0 and p” = 1 occurs 
when the changes in the cluster means occur independently 
between clusters (E.[a,,a,,] = 0), but the individuals retain 
their positions within the cluster, E,,, = €,. The case of 
p' = 1 and p" = 0 occurs when the cluster random effects 
are the same in both periods, a,, = a,,, while the individual 
random effects are uncorrelated (E.[¢,,,€,,,] = 0). Neither 
of these situations is entirely realistic. However, it can 
probably be expected that the individual, rather than the 
cluster, dynamics are a more important source of varia- 
tion over time, thus making the relations p' > p' > 0 
the most plausible ones. We shall study in numeric 
examples of Sections 4 and 5 the extent to which the 
choice of the best design is sensitive to the relation 
between the two correlations. 


3. Costs for repeated cluster samples 


In this section we shall analyze the cost efficiency of 
cluster samples when one wants to estimate the difference 
between two sample means from two different periods. 

Some discussion of the costs of cluster sampling is given 
in Kish (1995, Section 8.3B), Thompson (1992, Section 
12.5), and Lehtonen and Pahkinen (2004). More mathemati- 
cal details are available in Hansen ef al. (1953, volume II, 
Section 6.11), with the variance formulas corrected for finite 
populations. 
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3.1 Notation and cost models 


Let us assume the following cost structure, which is an 
extension of Kish (1995) for repeated surveys: 


- c) is the cluster level cost at time ¢ = 1 for clusters that 
are used in the first wave only; 


- c, is the cluster level cost for a new cluster at time 
t = 2; 


c|, is the cluster level cost for clusters in which the data 
are collected in both periods t =1 and ¢t = 2 (PSU 
panel cost); 


c| is the individual level cost at time ¢=1 for 
individuals that are observed in the first wave only; 


c} is the individual level cost at time ¢ = 2 for 
individuals that are observed in the second wave only; 


c}5 is the individual level cost if the unit is observed in 
both periods in the observation-panel design (SSU 
panel cost); 


C, is the total budget allocated to the field work in both 
time periods. 


Roman superscripts denote the sampling stage. Arabic 
subscripts correspond to the occasion at which the sample is 
taken. The cluster level costs include the cost of sampling 
the clusters, obtaining the PSU maps, collecting community 
data, local interviewer training, etc. The individual level 
costs are mostly those of the personal interviews with the 
ultimate observation units. The total cost C, is thought of as 
the variable cost of the survey that is directly related to the 
number of sampled units. Fixed cost, such as the cost of 
preparing the survey instrument and other organization-level 
costs are not part of C. 


3.2 Independent design 


The budget constraint for the independent design is given 


by 
(3.1) 


aS Ai I 0 
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The first two terms are the costs of the first wave of data 
collection, and the last two terms, of the second wave. 


Proposition 4. If the survey setting parameters are the same 
in the two time periods: 


(3.2) 


then the optimal sample sizes and the resulting variances 
are given by 
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In equations (3.3), the sample sizes n and m are treated 
as continuous variables. In practice, the nearest integer 
should be used, with a minimum of 2 necessary to estimate 
the appropriate variance component, and the maxima of N 
and M, respectively. 

The number of observations sampled within a cluster 
depends only on the relative costs at the cluster and the 
observation level, c'/c", and relative variances So hSe anor 
equivalently the intraclass correlation. Greater interview 
cost c" prevents the sample designer from using more 
observations: an increase in c' leads to a decrease in both 
m and n. Greater cluster-level cost leads to redistribution 
of the sampled units: n decreases with c', while m_ in- 
creases with it. Greater within-cluster variance S” necessi- 
tates a greater number of observations m to be taken within 
a cluster to maintain overall precision. Greater between- 
cluster variance S, necessitates a greater number of clusters 
n to be sampled. Finally, the total survey budget C, affects 
the number of clusters n, but not the subsample size m. As 
a result, the variance of d is inversely proportional to C). 

The non-symmetric situation can be treated as a by- 
product of the first order conditions derived in the proof (see 
Appendix). However, no analytic solution is available in 
that case. 


3.3 Cluster-panel design 


The budget constraint for the cluster-panel design is 
given by 

Cy = Cin + c!'nm, + ch nm. (3.4) 

The first term is the cluster-level cost associated with the 

sample design, and the remaining two terms are the costs of 


collecting individual-level data in the first and the second 
waves, respectively. 


Proposition 5. The sample sizes for the cluster-panel design 
are given by 
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The variance of the difference estimator is found by 
plugging these expressions into (2.13). Under the assump- 
tions of symmetric conditions in two rounds of the survey 
(2.9) and (3.2), 
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Mc! ee 
and V, .[d] can be found from (2.13). 
Interestingly, the number of the SSUs depends on the 
SSU costs c", but not on the PSU costs c/,. An increase in 
the intracluster correlation, or increase in S i or decrease in 
S°, predictably leads to decrease in the optimal number of 
SSUs and increase in the optimal number of PSUs. The 
dependence of the design parameters on the survey budget 
C, is non-trivial. For very small surveys, the number of 
units per cluster is proportional to C,, and the number of 
clusters is not affected by C,. Indeed, if the characteristic 
demonstrates strong correlation between time periods, it 
would be preferable to get accurate estimates of the cluster 
means, and good accuracy of the overall difference esti- 
mator will follow. To put it differently, the first term in 
(2.13) is relatively small by virtue of the positive correlation 
coefficient p', and the second term is inversely proportional 
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to C,. For large surveys, D « C,, so both the number of 
units per cluster and the number of clusters are proportional 
to Gy . The first term in (2.13) is then inversely propor- 
tional to Ce , and the second term is inversely propor- 
tional to C,. An increase in the budget of the survey will 
affect all terms, although to a different extent. 


3.4 Observation-panel design 


The budget constraint for the observation-panel design is 
given by 
(3.6) 
The first term is the cluster-level cost, and the second term 
is the cost of individual interviews. 


1 tl 
C, = Gott + cn. 


Proposition 6. The optimal sample sizes for the observation- 
panel design are given by 
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The design variance of the resulting difference estimator is 
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(3.7) 
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The sample size expressions (3.7) resemble the ones for 
the independent design, equation (3.3), with the cost of data 
collection in a single wave replaced by the cost of panel data 
collection, and the variance components S; and S\ 
replaced by (1— p')S; and (1—p")S?. The second stage 
sampling size m only depends on the relative cost at the 
cluster and observation levels, and on the ratio of the 
variance components augmented by the autocorrelations. 
Hence, like in the independent design, the dependency of 
the sample size on the scale of the survey is only through 
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nox C,, and the variance of the difference decreases 
inversely proportional to C). 

Extending the relations between the functional forms of 
equations (3.3) and (3.8), we can establish the general 
relations between the two designs: 


Proposition 7. If M>>1 and N>1, then V,idl2V,,14] 
if 


2( fel? + fo"S? ) 
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Unfortunately, the variance for the cluster-panel design 
that can be obtained by combining the results of Proposition 
5 with (2.13), does not permit an equally lucid comparison. 


4. Numeric illustration 


To illustrate how the characteristics of population 
(variances and autocorrelations) and the data collection 
process (costs) affect the choice of the most efficient design, 
we consider a numeric example. Let us choose the basic 
setup with symmetric conditions, and let the parameter 
values be: 


N= 10,000) 000, 5 = 100, 
S = A00.. Qos Oul st Den ects 


lates ate NG at ad 
C= 18, C= 20,000. (4.1) 


The cost structure implies that the cost of collecting the 
initial information for a cluster is the cost of ten interviews, 
while the cost of the followup in the same cluster is only 
eight interviews. On the other hand, getting the second 
interview with the same unit is twice as expensive as getting 
the first interview. 

With these parameters, the sample sizes and design 
variances are: 


Mae V2s 7, care V2) Mes 08s 
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V, [d] = 99.86, V, [4] = 91.37, V,,,[4] = 90.20. (4.2) 


eé,o 


The observation-panel design is 1.2% more efficient than 
the cluster-panel design, and 10.7% more efficient than the 
independent design. However, it has a notably smaller total 
sample size, only 2/3 of the cluster-panel design sample 
size and 70% of the independent design sample size. 
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Of course these finding are highly specific to the 
parameters of the population and the cost structure. Can we 
describe general patterns of how the variances, and hence 
the relative efficiency of different designs, change with 
those parameters? The variances in (4.2) are derived from 
13 parameters given in (4.1), and it is difficult to make 
meaningful statements about all of these parameters simulta- 
neously. Below, we shall attempt to provide two-dimen- 
sional cross-sections of this 13-dimensional space and give 
graphical illustrations of the variability of the design 
variances, and hence the domains of optimality of each 
design, as we vary two parameters at a time. We provide the 
graphs of variances of the designs involved (typically, the 
cluster-panel design with dotted lines, the observation-panel 
design with dashed lines, and the independent design with 
dash-dotted lines. For most plots, the independent design is 
not affected by the variations of the parameters that make up 
the axis of the plots, and hence omitted). We also show the 
relative efficiency of different designs, marking the domains 
of the parameter space in yellow/light gray if the inde- 
pendent design is the most efficient one; in green/medium 
gray if the cluster-panel design is the most efficient one; and 
in purple/dark gray if the observation-panel design is the 
most efficient one (R code used to produce graphs is avail- 
able at http://web.missouri.edu/~kolenikovs/SMJ2011/). 
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Figure 1 shows how the design variances, and hence the 
most efficient design, vary with the panel costs of the PSU 
and SSU, ¢, and ci). Obviously, these variations do not 
affect the variance of the independent design, which serves 
as a benchmark. Also, the variations in cj} do not affect the 
performance of the cluster-panel design, which corresponds 
to the dotted vertical iso-variance lines on the left panel. The 
dashed downward sloping lines are the iso-variance lines for 
the observation-panel design. Note that the lower left corner 
of the graph corresponds to the free lunch situation in which 
the second wave of data collection does not cost anything: 
the panel costs are equal to the single period cost, a= el 
C7 = ¢;'. When the costs of the panel data collection are 
prohibitively high (the upper right corner of the graph), the 
independent design is the most efficient one. The point 
where all three designs have the same variances is cj, = 
22, c) = 3.05; i.e., the cost of the second interview is 2.05 
higher than the cost of the first interview, and the cluster- 
level costs in the second wave are 20% higher than in the 
first wave. Still, a positive autocorrelation justifies the 
reduction in the sample size of the observation-panel design 
as compared to the independent design. If the cluster level 
panel cost is lower and the second interview cost is higher, 
the cluster-panel design is the most efficient. For 
inexpensive second interviews, the most efficient design is 
the observation-panel design. The latter domain includes our 
baseline case with c,, = 18 and cl = 3. 
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SSU panel cost 
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Design variances as functions of the data collection costs ral cly- Left: contour lines of V.,cl4] (dotted) and V, ,[d] 


(long dashed); V, , = 99.86; right: domains of optimality of the three designs 
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Figure 2 shows the changes in design variances asso- 
ciated with the changes in the autocorrelations p', p'. The 
independent design variance is unaffected by these varla- 
tions, and the cluster-panel design is unaffected by varia- 
tions in p". The observation-panel design is more efficient 
for higher SSU autocorrelation, p" > 0.34. Otherwise, the 
cluster-panel design provides lower variance. 

Figure 3 investigates the impact of the cluster-level cost 
and autocorrelation on the choice of the design. The combi- 
nations of expensive second wave of data collection and low 
PSU autocorrelation in the upper left corner of the plot 
makes the independent design the most appealing one. 
Otherwise, the observation-panel design is the best one to 
use. Note that the contour lines for the cluster-panel and 
observation-panel designs are very close to one another, and 
differences in variances between the two designs are less 
than 2% in the whole parameter space of this plot. 

Figure 4 investigates the impact of the observation-level 
cost and autocorrelation on the choice of the design. Neither 
the independent design nor the cluster-panel design vari- 
ances are affected by variation of the parameters shown on 
this plot. The independent design variance is 99.86, while the 
cluster-panel design variance is 91.37, so the observation- 
panel design is compared to the latter only. High auto- 
correlations (p" > 0.6) can justify very high cost of the 
second interview (up to fourfold compared to the first 
interview), but in the upper left corner of the plot corre- 
sponding to the low autocorrelations and high panel cost, 
the cluster-panel design performs better. 
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Figure 5 relates the design variances to the cluster-level 
costs of the survey. The horizontal axis is the cost in the first 
period, c|, and the vertical axis is the additional cost of in 
the second period when the data are collected in a panel 
mode, cj, —cj. The vertical axis is ignored for the 
independent design, as this parameter does not appear in the 
independent design. Also, by virtue of (4.1), cl = ch. The 
observation-panel design is uniformly better than the 
cluster-panel design for all parameter combinations on this 
graph, although the difference in variances does not exceed 
2%. In the upper left corner, the additional cost of the panel 
mode of data collection is prohibitively high, and the 
independent design offers better performance. 

Figure 6 shows the dependence of the most efficient 
design on the total budget of the survey and the cost of panel 
mode of data collection at the cluster level. For C, > 10,000, 
the observation-panel design performs better if Gyn les 
ie., if the additional cost of the panel mode of data 
collection at the cluster level does not exceed 127% of the 
initial cluster-level cost in the first wave. Interestingly, for 
some isolated parameter configurations in small surveys, the 
cluster-panel design can perform better than the observation- 
panel design that dominates the rest of the plot. The differ- 
ence in design variances between the cluster-panel and 
observation-panel designs is less than 4% across all para- 
meter combinations on this graph. 
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Figure 2 Design variances as functions of the population correlations pe oe Left: contour lines of V, .[d] (dotted) and V., ol@] 
(long dashed); V, , = 99.86; right: ratio V, ,[d] /V, ld] 
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Figure 3 Design variances as functions of the cluster-level autocorrelation p! and cost Cis Left: contour lines of V.,-[d] (dotted) 
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Figure 4 Design variances as functions of the observation-level autocorrelation p'' and cost c};. Left: contour lines of V., old] 
(long dashed); Vere 2 9986; V., -[d] = 91.37; right: ratio Ve, old] /V., Ld] 
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Figure 5 Design variances as functions of the cluster level costs in the first wave, Ch and in the second wave, ch = c} . Left: contour lines 
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Figure 6 Design variances as functions of the total budget C, and the PSU panel cost ae Left: contour lines of V, .[d] (dotted), 
V,, [4] (long dashed) and V, ,[d] (dash-dotted); right: domains of optimality of the three designs 


Overall, this numeric illustration shows that depending 
on the parameters of the population and costs of data 
collection, each of the three designs can be the most effi- 
cient one. Low correlations and high costs in the second 
wave tend to favor the independent design. Given that the 
initial six population parameters and five cost parameters 
may not be representative of many repeated surveys, a 
sensitivity analysis like the one performed here may be 
needed for any particular survey a statistician needs to 
design. 


5. Survey design with multiple criteria 


So far, our analysis was confined to estimation of the dif- 
ference between the means in two waves of data collection 
of a single variable. Most large scale surveys are collected to 
study several characteristics, and to many users, the contem- 
poraneous estimates are also of interest. To accommodate 
accuracy requirements associated with these different vari- 
ables and different estimates, the survey designer must have 
several variances in mind when choosing the design to be 
implemented. This is a multicriterial optimization problem, 
and no single design will work best for all possible esti- 
mation problems. In the current context, the observation- 
panel design may give good estimates of the change when 
both PSU and SSU autocorrelations are high, but it may 
result in a small sample size if both PSUs and SSUs are 
expensive to follow up. Greater precision of the estimates 
for any single period could be obtained by switching to 
the cluster-panel or even independent designs. 
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Comparing different designs in this situation is possible 
with the standard microeconomic argument of utility maxi- 
mization under budget constraints (Mas-Colell, Whinston 
and Green 1995). In the survey design context, the utility of 
the survey designer increases with the precision of the 
survey estimates, or equivalently decreases with survey vari- 
ances. A simple functional form is given by Cobb-Douglas 
utility function: 

U(design) = VaestnlVr.} Vaesizn LY] Waesign 4) (5-1) 
Here, o,, @, and a, are positive constants describing the 
relative weights of the three design variances in decision- 
making process. Variances V[y,] and V[¥,] in (5.2) are 
the variances of the means in cluster surveys given by (2.8). 
The variance of the difference estimator is (2.10), (2.12) or 
(2.14), depending on the design. The survey designer prob- 
lem is then to maximize (5.1) subject to design-specific 
budget constraints (3.1), (3.4) or (3.6). Maximization is 
performed over the design parameters (mode of data collec- 
tion, number of clusters in each time period, number of 
observations in each time period), given the characteristics of 
population (variances and autocorrelations) and the data 
collection process (costs). 

Let us assume that the precision of each of the three 
estimates y,, Y, and d is equally important to the decision 
maker, so a, = @, = @;. To have an objective function 
that is measured in the variance units and is on the same 
scale as variances, it will be convenient to define a multi- 
criterial variance 
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= Qa [y,..] Vacs [V..] NV deh ee (S22) 


Ve gn 


and express the optimization problem as minimization of 
this expression. 

Analytic characterization of the design that optimizes 
(5.2) becomes quite cumbersome. Instead, we utilize a 
numeric illustration of the previous section to demonstrate 
how accounting for other design objectives affects the choice 
of the design. We should expect that for the designs with 
more expensive follow-ups (c,,2 c+ ¢}, ¢/) > cl + c}), 
the simpler designs would be selected more often: the 
cluster-panel design may be preferred to the observation- 
panel design, and the independent design may be preferred 
to the cluster-panel design. For the baseline settings (4.1), 
we have 


V. LY] = 49.93, V, .[F] = 47.68, V, [7] = 61.69, 


Wiis O29 at Vy ts 59.23, OV = 70109: 


where the time indices of y, are omitted. The observation- 
panel design is rather inefficient in estimating the period- 
specific means as this design samples fewer units. Instead, 
the cluster-panel design is the most efficient one, closely 
followed by the independent design. 

Figures 7-12 parallel Figure 1-6, respectively. Since the 
best design in terms of V is now the cluster-panel design, 
most of these plots show the preference toward this design. 
Figure 7 shows that when the variances of the contempora- 
neous means are taken into account, the simpler inde- 
pendent and cluster-panel designs are preferred for a greater 
fraction of parameter settings, and occupy a larger portion of 
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the plot than in Figure 1. The point where the three designs 
are equivalent is cj, = 20.6, c), = 2.27, closer to the 
origin than in Figure 1, in which only the variance of the 
difference was taken into account. 

Figure 8 shows that the observation-panel design is only 
justified when both autocorrelations are higher than 0.6 (for 
the given values of population variances and costs). Recall 
that in Figure 2, the observation-panel design was preferred 
whenever p' > 0.34, with little dependence on p'. 

Figure 9 shows how the PSU-level correlations and costs 
affect the choice of the design. The observation-panel 
design is less efficient than the cluster-panel design for all 
combinations of parameters in this plot. Hence, the choice 
of the design is between the independent and the cluster- 
panel designs. Naturally, if the data collection in the panel 
mode is expensive, the independent design is preferred to 
the cluster-panel design. Interestingly, the preference towards 
a particular design is not monotone in p},. With values 
Pi, > 0.7, the V[d] component in (5.2) produces designs 
with so few clusters that V[}] suffers notably enough to 
hurt the whole objective function. At that value of panel 
autocorrelation, the maximum panel cost at which the 
cluster-panel design is still the most efficient one is c|, = 
24.4, i.e., the cluster-level cost in the second wave is 44% 
higher than in the first wave. 

Figure 10 shows that the higher autocorrelation of the 
SSU measurements may justify modest extra cost associated 
with data collection. The highest cost for which the obser- 
vation-panel design is still the most efficient one is cj} 
2.75 with p" = (0.78; i.e., the cost of the second interview 
can be 75% more than the cost of the first interview. 
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Figure 10 Design variances as functions of the observation-level autocorrelation p'' and cost a Left: contour lines of V, , (long 
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Figure 11 parallels Figure 5. The left panel shows that the 
observation-panel design is less efficient than the cluster- 
panel design. The right panel shows that if the cluster-level 
cost of the second wave exceeds the cluster-level cost of the 
first wave by more than 15 units, the independent design 
delivers better efficiency than the cluster-panel design. 

Finally, Figure 12 shows the variances as functions of the 
total survey budget and the cost of the panel mode of data 
collection. There is very little dependence on C, in the plot, 
and the independent design is preferred if the panel mode is 
too expensive, namely, when the cluster-level cost in the 
second cost exceeds 107% of that in the first wave. 
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As it was conjectured in the beginning of this section, 
incorporation of the variances of the contemporaneous 
means into the design optimization objective function shifted 
the preferences of the survey designer towards simpler 
designs that can sample a greater number of the ultimate 
observation units. The observation-panel design now only 
makes sense when both the PSU and SSU autocorrelations 
are high, and the panel costs are reasonably low. Moreover, 
the cluster-panel design is generally justified only if there is 
an economy in cluster-level cost in the second wave of the 
survey. 
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Figure 11 Design variances as functions of the data collection costs es Bite Left: contour lines of V, , (dotted), V... (long dashed) 
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Figure 12 Design variances as functions of the total budget C, and the PSU panel cost Crs Left: contour lines of V, . (dotted), 
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6. Extensions to multiple waves 


If the survey to be designed will have more than two 
waves of data collection, the survey designer may be able to 
extend the framework of the utility maximization problem 
(5.1), with the following considerations in mind. 


1. A greater number of targets of inference. Possible 
variances that the survey designer may need to take 
into account can now include: contemporaneous vari- 
ances V[y,], V[¥,], -.-. VL¥;]; consecutive differ- 
ences. V[V,— Ve, Vi — ya) OF composite: 
GLS estimators of the change between two adjacent 
periods of time; other contrasts V[>,c, y,],¥c, =0; 
variance of the linear growth rates from regression of 
y, on ¢t, estimated by OLS or GLS; etc. 

2. A possibility of discounting. In economics, it is cus- 
tomary to specify the budget constraints that look into 
the future in the form of ¥,x,5' where x, is the 
amount spent in time ¢, and 6 <1 is the discount 
factor associated with interest rates. Discounting may 
also be relevant for the utility function, and design 
variances farther in the future may have lower 
weights in the optimization problem. 

3. Unknown functional forms of the time-series pro- 
cesses associated with the variable of interest. The 
survey designer needs to have a good idea about the 
covariance structure of the time series of both indi- 
vidual observations and cluster means. It is likely that 
the results will be sensitive to the choice of the 
particular model. In the current analysis, the issue 1s 
ameliorated, as it suffices to have a single correlation 
parameter for each level. The survey designer may 
have to introduce more parameters into the model, 
and correspondingly study sensitivity of the design 
choice with respect to these parameters. 


The complexity of the problem, as outlined above, can grow 
out of control very quickly. We thus abstain from a more 
detailed treatment of it in this paper. 


7. Discussion 


This paper has analyzed different options for imple- 
mentation of repeated cluster surveys. We have provided 
analytical expression for design variances of the simple 
difference estimator for three popular designs (the inde- 
pendent, the cluster-panel and the observation-panel de- 
signs). We have also derived the optimal sample sizes for 
estimation of the difference between two waves of data 
collection. 
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The sample designer who knows that the characteristic of 
interest is going to have some degree of persistence over 
time will likely choose one of the panel designs, provided 
that the costs of re-visiting the clusters and/or observation 
units are not prohibitively high. Analytical comparison 1s 
possible between the independent and the observation-panel 
designs, and is given by Proposition 7. It is worth noting that 
the design variance of the difference is O(C,') for both the 
independent design and the observation-panel design, and is 
OCF ') for the cluster-panel design, where C, is the total 
budget of the survey. Hence the cluster-panel design is only 
viable for smaller surveys, while the large scale surveys will 
likely have either the independent or the observation-panel 
format. 

The cost structure considered in Section 3 is rather 
simplistic. For instance, the second stage costs in the second 
time period may differ across individuals sampled from the 
new or from the reused clusters. Also, the costs may depend 
on the cluster size M,, as it may take more time and 
resources to obtain maps and collect cluster level data for 
bigger clusters. Our original motivation was to consider 
situations in which the SSU panel cost is higher than twice 
the cost of individual interviews. However, as suggested by 
one of the referees, this cost may be lower if the follow-up 
interviews are performed in cheaper mode, such as a phone 
interview or a self-administered mail survey instead of a 
personal interview. If this is the case, the observation-panel 
design is apparently the most cost-efficient of the three 
designs. 

The population structure is also an oversimplification. 
The clusters are assumed to be of balanced unchanging 
sizes. No units leave the population, and no new units 
appear. These assumptions are quite restrictive for many 
practical situations. If the population changes between two 
waves of data collection, the sample designer would want to 
include new clusters at the second wave, using the 
algorithms of Ernst (1999). The new clusters are placed into 
a separate stratum, and a clustered sample is taken from that 
stratum. In NHIS, this is implemented by “permit” frame. 
Also, the dynamic measurement effects such as condi- 
tioning and time in sample lead to rotation bias, so it might 
be beneficial to provide at least some rotation of the PSUs. 
For DHS studies, in particular, the first argument (coverage) 
is likely to be more important than the second one (time in 
sample) due to a substantial time between the waves of the 
survey (about 5 years). Arguably, both non-response and 
loss of coverage can be added to the current framework as 
sources of bias, leading to optimization of the mean squared 
total survey error rather than the design variance. Con- 
vincing models of such biases may be difficult to formulate, 
however. 
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Another issue that would arise with clusters of different 
sizes is that of the greater range of applicable designs. In this 
paper, we assumed SRSWOR at both stages. Other designs, 
such as sampling with probability proportional to size (PPS), 
can be used instead. For designs other than SRS, the Horvitz- 
Thompson estimator and its variance (Sarndal, Swensson 
and Wretman 1992, Thompson 1997) would need to be used. 
The analytical derivations become unwieldy, although prac- 
tical numerical demonstrations similar to our Sections 4 and 
5 can still be implemented. If cluster sizes change over time, 
obtaining the optimal design becomes a moving target, and 
designs optimal for the “old” measures of size will lose their 
efficiency with the “new” measures of size. 

In earlier drafts of this paper, we analyzed intermediate 
designs where a non-trivial fraction of the units are retained, 
and other units are sampled independently. The problem can 
then be viewed as variance minimization subject to inequal- 
ity constraints on the degree of the overlap 0 < rm! <1, 
0 <x" <1. The general theory of non-linear constrained 
optimization ensures that as long as the variance of the 
population mean change D is monotone in 1! and r", the 
optimum will be achieved in one of the vertices of the 
parameter space. This justifies our interest in the three 
designs considered in the paper. They correspond to the 
vertices of the parameter space: (0, 0), (1,0) and (1,1) for 
the independent, cluster-panel and observation-panel de- 
signs, respectively. The point (0,1) corresponds to an im- 
possible design with complete overlap of the individual 
units with no overlap of the clusters. Cumbersome deri- 
vations show that it is possible to satisfy the first order 
conditions in some intermediate cases, too, but they corre- 
spond to local maxima of the variance. While these results 
may also be of interest (in the sense of providing an upper 
bound on the design variances), we did not consider them in 
the paper. In the more complicated cases of the multicriterial 
optimization of Section 5, monotonicity does not necessarily 
hold, and other designs beside the three extreme cases 
considered in the paper may lead to the optimal values of 
the objective function (5.2). 

Conditions of equal variances (2.9) can be relaxed at the 
price of producing substantially more complicated expres- 
sions. If the sample sizes are fixed between the two occa- 
sions, then the following changes will be necessary in all 
relevant formulas. In the expressions that do not involve 
autocorrelations, 


2 


2Sp Stes) 2S DES Se 


lw 


(Ga) 
while in the expressions that do involve autocorrelations, 
2(1—p')S; +> Si, +S, - 2p'SiySo4> 


25;,(1 a p") lie hed a at = 2D SE Sa. (7.2) 


Qualitatively, the results will be the same. 
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The multicriterial framework of Section 5 allows for 
different importance weights to be given to different vari- 
ances of interest. Relatively larger values of o.,, a, corre- 
spond to the greater importance of the contemporaneous 
means, while larger values of a, correspond to the greater 
importance of the change estimate. The original problem of 
optimizing the design for V[d] can be considered within 
the context of (5.1) by setting a, = a, = 0, a, = 1. This 
framework can also be expanded to include designs aimed at 
measuring several variables. An additional challenge of such 
a setup is that the autocorrelations may differ across 
different variables. Some individual characteristics are 
constant over time (race, gender); others change slowly 
(housing, expenditure, political preferences), yet others may 
change faster (income or behavior). 

This paper dealt with three designs and a specific 
estimator of change: the difference in the two estimates of 
the mean in two periods of time. Other options for either 
designs or estimators are also available. For instance, in 
rotation designs, a fraction of the first wave units is retained, 
and some new units are recruited. For such designs, com- 
posite estimation (Hansen et al. 1953, Patterson 1950, Rao 
and Graham 1964, Wolter 2007) that weighs differently the 
contributions of the independent units (those retired from 
the sample after the first wave, and those newly recruited for 
the second wave) and the contributions of the panel units 
(used in both waves) would result in more efficient esti- 
mates. Generally, motivation for such designs comes from 
non-sampling considerations, such as decrease of the re- 
sponse burden and deterioration of the sample represen- 
tativeness of population due to the population change. These 
considerations can be accounted for in either the cost model 
(e.g., a greater number of callbacks required to convince a 
unit to respond), or the total survey error model (by intro- 
ducing the non-response or undercoverage bias, and con- 
sidering mean squared error rather than the design variance 
of an estimate). 
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Appendix 


Expectations, variances and covariances in the proofs 
below are with respect to the corresponding designs. The 
first stage of selection will be denoted with a superscript I. 
The second stage of selection will be denoted with a 
superscript IL. 


Proof of Proposition 2. Let us denote the sample of the 
PSUs by S’, the sample of SSUs in the first poud by Ss 
and the sample of SSUs in the second period by Soe hen 


Denoting the expectations with respect to the first stage as 
E,, and those with respect to the second stage as E;,, we 
have the design variance of d equal to 
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where the last equality assumes symmetric conditions (2.9). 


Statistics Canada, Catalogue No. 12-001-X 


Proof of Proposition 3. Let us denote the sample of the 
PSUs by S', and the sample of SSUs, by Suathen 
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Denoting the expectations with respect to the first stage as 
E,, and those with respect to the second stage as E,,, we 
have the design variance of d equal to 
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with the last equality holding under the symmetry 
conditions. 


Proof of Proposition 4. The Lagrangian function of mini- 
mizing (2.11) subject to constraint (3.1) is 
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Working through the first order conditions of this 
Lagrangian function leads to 
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Utilizing these conditions, we have 
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From the survey budget (3.1), the number of clusters is 
found to be 
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Plugging these expressions into (2.11) and using the equal- 
ity relations (2.9), we obtain the variance of the estimator as 
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Proof of Proposition 5. The Lagrangian function of mini- 
mizing (2.13) subject to constraint (3.4) is 
Peasy Ss 
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The solution with —VD_ leads to a negative value of m,, 
and must be discarded. 
The remaining design characteristics are 
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The variance of the difference estimator can be found using 
(2515): 
Under symmetric conditions, « = 1, and 


= 40 gi ySeeewes. (Mle c58. 


is non-negative unless the expression in the square brackets 
is negative (which can only happen when p' is large and 
M is small. In that case, a comer solution m= M is 
realized). Furthermore, 
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2(1-p')S; 


Proof of Proposition 6. The Lagrangian function of mini- 
mizing (2.15) subject to constraint (3.6) is 
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Finally, the variance of the difference estimator is 
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Proof of Proposition 7. Ignoring the finite population cor- 
recting terms of the order O(N) and O(M'), equation 
(3.3) can be written as: 
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The statement of Propostion 7 follows immediately from 
these two expressions. 
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On the efficiency of randomized probability proportional to size sampling 


Paul Knottnerus ! 


Abstract 


This paper examines the efficiency of the Horvitz-Thompson estimator from a systematic probability proportional to size (PPS) 
sample drawn from a randomly ordered list. In particular, the efficiency is compared with that of an ordinary ratio estimator. 
The theoretical results are confirmed empirically with of a simulation study using Dutch data from the Producer Price Index. 


Key Words: Horvitz-Thompson estimator; Producer Price Index; Ratio estimator; Sampling autocorrelation coefficient. 


1. Introduction 


When the study variable y ina population of N units is 
more or less proportional to a size variable x, one may use 
the ratio estimator from a simple random sample of size n 
without replacement (SRS). An alternative estimator in such 
a situation is the Horvitz-Thompson (HT) estimator in 
combination with a systematic probability proportional to 
size sample from a randomly ordered list, henceforth called 
a randomized PPS sample. 

In recent years several authors investigated variance esti- 
mation procedures for the HT estimator from a randomized 
PPS sample. See, among others, Brewer and Donadio 
(2003), Cumberland and Royall (1981), Deville (1999), 
Knottnerus (2003), Kott (1988 and 2005), Rosén (1997) and 
Stehman and Overton (1994). For a comparison between the 
efficiencies of the ratio estimator and the randomized PPS 
estimator, the reader is referred to Foreman and Brewer 
(1971), Cochran (1977) and the references given therein. A 
drawback of these comparisons is that finite populations 
corrections are ignored. Hartley and Rao (1962) take the 
finite population correction into account but without an 
explicit formula for the efficiency. Elaborating on the results 
of Gabler (1984), Qualité (2008) shows that the related HT 
estimator from a rejective Poisson sample of size n is more 
efficient than the Hansen-Hurwitz estimator for a sampling 
scheme with replacement. No formula for the increased 
efficiency is given, however. 

The main aim of this paper is to derive formulas for the 
efficiency of the randomized PPS estimator relative to the 
ratio estimator. To this end, we present a simple formula for 
the change in the sample size required to maintain the same 
variance when a randomized PPS estimator is replaced by a 
ratio estimator. From the design based point of view these 
formulas are valid when n = 0(N) as N — ©. This con- 
dition suggests that the finite population correction can be 
neglected for this kind of sampling design. Surprisingly, as 
we will see in an example in section 4, the randomized PPS 
sampling can reduce variance by more than 30% compared 


to PPS sampling with replacement even when the sampling 
fraction n/N is much smaller than 30%; see also Kott 
(2005, page 436). Furthermore, the formulas remain ap- 
propriate from a model assisted point of view when n and 
N are of the same order, provided that N is large and that 
the hypothetical model for the observations  Y, (i = 
1,..., ’) satisfies mild conditions. 

The outline of the paper is as follows. Section 2 describes 
an alternative expression for the variance of the HT esti- 
mator based on the sampling autocorrelation coefficient. 
The corresponding variance estimator for randomized PPS 
sampling is shown to be nonnegative with probability 1. 
Section 3 presents the formulas for the efficiency of the 
randomized PPS estimator relative to the ratio estimator for 
various data patterns often met in practice. Section 4 
features an example with data on the Producer Price Index 
in The Netherlands illustrating the substantial efficiency 
gains obtainable in practice. A counterexample shows that 
randomized PPS sampling is not a/ways advantageous. The 
paper concludes with a summary. 


2. An alternative variance expression for 
randomized PPS sampling 


Consider a population U = {l, ..., Nt, and let s bea 
sample of fixed size n drawn from U without replacement 
according to a given sampling design with first order inclu- 
sion probabilities 2, and second order inclusion proba- 
bilities 7, (7, 7 =1,..., N). The HT estimator of the popu- 


lation totaly Y= 2777,).0 is, detined by Y.. = 2,..¥,/2,. 
Suppose there is a measure of relative size X, (i.e, X = 
Xv; = 1) such that all X, < 1/n. In fact, it is assumed 
here that units with Y, > 1/n are put together in a separate 
certainty-stratum. When the x, are proportional to these 
size measures, m, = nX;. Defining Z, = Y,/X,, we can 
write Y as a weighted mean of the Z,, that is, Y= pt, = 
Lv+;Z;. Likewise, we can write the HT estimator of Y in 
randomized PPS sampling as ee Voc Ze Where 7 ais 
sample mean of the Z,,. 
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The variance of the randomized PPS estimator Ypps is 


1 
vat(Ypps) = ay ey (ny — 1;0)Z,Z; (1) 
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with 1, = 7,. The former is attributed to Horvitz and 
Thompson (1952) and the latter is due to Sen (1953) and 
Yates and Grundy (1953). The following alternative expres- 
sion for the variance is more convenient for our purposes: 


var(Yoos) = var(Z,) = {1+@—-Dp}—=, — @) 


where 62 = Diey X;(Z, — m,)°, and 
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For a proof of (3), see Knottnerus (2003, page 103). Note 
that o2/n would have been the variance if the sample had 
been drawn with replacement with drawing probabili- 
ties X,. 

The sampling autocorrelation coefficient p, in (4) is a 
generalization of the more familiar intraclass correlation 
coefficient p in systematic sampling with equal probabi- 
lities; see, for instance, Cochran (1977, pages 209 and 240) 
and Sarndal, Swensson and Wretman (1992, page 79). Note 
that p, is a fixed population parameter. The phrase sam- 
pling autocorrelation is used because p, refers to the 
autocorrelation between two randomly chosen observations, 
say z,, and z,,, from s. Consequently, the value of p, 
depends on the sampling design. In particular, when sam- 
pling with replacement, p, = 0, while under SRS sam- 
ping oe Na). 

Although exact expressions for the 1, under randomized 
PPS sampling are available, they can be cumbersome when 
N is large. For an exact expression, see Connor (1966) and 
for a modification Hidiroglou and Gray (1980). Here we use 
an approximation proposed by Knottnerus (2003, page 197): 


X,X,(1- X, - X,) 


Se a (5) 
y(l — 2X,)(1 - 2X,) 


Tix = n(n —1) 


xX; 
OR 


1 
es 2 =" 5 es 


These 7, have been shown to satisfy the second-order 
restrictions for the 7,;: 


De wae gnR =n(n—I), 


and 
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payee =(n—I)T,. 


Furthermore, (5) is correct for SRS sampling for any 
n < N, while 7, coincide with the 7,,,, from the special 
designs proposed by Brewer (1963a) and Durbin (1967) for 
PPS samples with n = 2. Moreover, the 7, in (5) can be 
written in factorized form as proposed by Brewer and 
Donadio (2003). That is, 


Ti = 1%; (a c;) i 2, (6) 


and 
C= (n=l yiny lL 25); 


An implication of approximation (5) is that 1,,./n(n—1) 
does not depend on n. Hence, the corresponding approxi- 
mation of p, does not depend on n (recall we have 
assumed that every X; <1/n). 

This nondependence on n would also result had we used 
the approximation proposed by Hartley and Rao (1962) for 
randomized PPS sampling: 


Tie = n(n —WX,X, 
{1+X,+X,-p,+ 2(X7 +X; +X;X,) 
in 3p, (x; Ae xX; — By, — Dye hs (7) 


where w= 2) px) ecall pl, =2,4,X, 2). Obviously, 
Tip n(n —1) does not depend on n. At the time Hartley 
and Rao assumed that n = O(1) as N — o. In addition, 
referring to a private conversation with J.N.K. Rao, 
Thompson and Wu (2008) state that approximation (7) is 
valid when n = o(N) as N — . For an example that (5) 
and (7) can not be used for any n and N, see Appendix A. 

Since both (5) and (7) lead to approximations for p, in 
randomized PPS sampling that are p,{1+o(1)} as No 
with n = o(N), (5) can be used for calculating p, in prac- 
tice when n << N and N is large. For ease of the exposi- 
tion, it is assumed here that there is a positive constant c 
such that p, < —c/N. See also Kott (2005, page 436) who 
discusses estimating the variance under PPS sampling when 
n = O(N”). 

Suppose y =1+pu,+O0/N*) and p,=O(/N) 
(which follow from the conditions of Theorem | below). It 
is not hard to see that, after dropping O(1/nN) terms, c, in 
(6) is identical with c,,2 = (n — 1)/{nd + pw, -2X;)}. The 
latter expression is equation (11) of Brewer and Donadio, 
which is based on 7,;2 10 (7). 

The approach proposed here is somewhat different from 
Knottnerus (2003). First, rewrite (5) as 


Aad 5 2 
Ti = n(n — 1) —+ ui + lee 8} 
Y Lode ipl 2X, 
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Substituting (8) into (4), we obtain a new, simple approxi- 
mation for p,: 


AX Ty Zy 
SIDS ; WP rf ya : ; 
Yer =7 1-2X, 1-2xX. om os 


J#i 


Sy) ye I Zj-Y,\| 2,-¥ 
ieU jeU yi 1 =2%X; Oo, oO, 


J#i 


pe An a Pays 
ne ager eA ®) 


In the second line, we used the equality Di, ; My; = 


ij I 
>a j MV; when 4G = Mj, In the last line, Ge used 
eee (Z,-Y)= 


Next, let x 1B the population mean of X;, ..., Xy 
and define o~ and V2 by 


So = Daa X(X; - pee 


and 


Ve = > ei — X) IN, 


x 


respectively. In the following theorem (9) is further 
simplified. 


Theorem 1. Suppose that (Z,— Y)/o,= O(1) as N > «© 
and that there are positive constants c and C such that 
aX —c.celti.<e and 0<X,.<C<1/2,. Then, for 
large N and n<<N, 


2 ee e o(=}} of L } (10) 
Dra (Zit) M a 


Proof. Because X =1/N, it follows from the above 
assumptions that the weighted mean p, [=LX; =N(V? + 
X’)] is of order 1/N and hence, 6, = O(1/N). Because 
(1-2X,)'=1+ 2X,+0(X/) for 0< X, <C<1/2, p, 


from (9) can be written for N — oo as 


2 
2 ZE(21 Ady 2) 
eon if oF icU 


where Dj. X; = 07 + w2= O(N”), and 


y= 2+ DX, (14 2X, + O(K2)} 


2 ieU 


l l 
=l+p, +O} —]=1+0;—|, 
cS GF) r 


from which (10) follows. This concludes the proof. 


oN 


Substituting (10) into (3), we get 


y 


vat(Yors) = = 


icU 


=-y x, {1 — (n -1)X,}(Z, — Y)’, (11) 
N iev 
which is also given by Hartley and Rao (1962). It is note- 
worthy that approximation (10) also follows directly from 
substituting the simple approximation 1, ,, = n(n —1) 
X;X; into (4). Likewise, use of 1,,, leads to an expres- 
sion almost similar to (9) and hence to (10). In addition, 
direct use of T4p im (1) or (2) for the SRS case with 
X,=X,=1/N may lead to errors of more than 100% for 
populations with Y = Bes see Knottnerus (2003, pages 
274-6). Hence, (1) and (2) are more sensitive to small errors 
in the m,, than (3) and (4). Furthermore, note that when n is 
so small that Inp. |<< 1, we may set p, = 0 yielding the 
with-replacement variance formula of Hansen and Hurwitz 
(1943). 
In order to estimate (3) using p., denote, as before, a 
randomly chosen observation from s by z,,. Then we have 


oO, = var(z,,) = var{E(z,, [s)} + E{var(z,, |s)} 


n—| ‘ 
iS, . 
n Zz 


pn (Z, —Z, i 


Now from (3), it is seen that s?/(1- p,) 1S an unbiased 
estimator for o2. When p. is very small, the term (1 — p.) 
can be neglected. When n is sufficiently large, the ratio p, 
from (9) can be estimated by 


| 


= var) + 


where 


eee 


o 


n—1| 


/9(1-2X,) 


where 


a Aa ae 
=—+—). , 
as dies} Ee 


Because y21 and X,<1/n, we have 6, > -1/(n—2). 
For the bias of an estimated ratio when n is small, see 
Cochran (1977, page 160). 

In a similar manner p, from (10) can be estimated by 


Hence, replacing o? and p. in (3) by s-/(1—f,49) and 
P.19» Tespectively, leads to a nonnegative variance estimator 
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with probability 1. This also holds for 6., when all x; < 
1/(n +1). The estimator for var(Ypps) thus obtained be- 
comes 
ee (ena yin 
vat, (Ypps) == es De Pes a 
n(l — B,5) 


Moreover, for moderate values of N, estimator p,, has 
probably better properties than f,,, because the Tx under- 
lying (9) satisfy exactly the second-order restrictions irre- 
spective of the values of n and N. 


3. Efficiency of Y,,, for large n and N 


3.1 Efficiency formulas 
Because Y = l, the ratio estimator for Y becomes 
De Ae, 
pia 


For sufficiently large n the commonly used approximation 
for its variance is 


Vs 
Ns (i  — 
Be 


A N(N = n) 2) 9) 
var(Y,) = ————_ )_ x; (Z, - YY. 12 
(Ye) ei ee or) 
From (3) and (12) it can be seen that the efficiency of ean 
relative to Y, can be written as 


var(Y,) .: 


N= md sey Xt Zr 
vat(Ypps) j 


E ay = 
Mein 1 +(n—-1)p,}o° 


(13) 


assuming N / (N — 1) ~ 1. Combining (10) and (13) gives 


== (UN) -- n)p. 


Eff... = ——————. 
Mix 1+(n—-l)p, 


(14) 


Now suppose that the observations Y, satisfy the model: 


Y, = LX, + €;, (15) 


with E(e,) = 0, E(e;)=0"X?, and E(e;) =0( # Jj). 
Consequently, for the Z, we have Z,=ut+u, with 
E(u, =0, E(u?) =0°X?*, and E(uu,) =0 # Jj). 
According to Kott (1988), 5 often lies between | and 2. See 
also Brewer (1963b). Brewer and Donadio (2003) showed 
that by assuming a model like (15), (7) and hence (10) and 
(14) hold when n and N are of the same order as 
N —+ o. Furthermore, for sufficiently large N we can 
replace Y as well as the numerator and denominator in (10) 
by their model expectations. This yields 
Lieu *, . 


62S 16 
fe ee ar 
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In the next subsections we look more closely at the 
relationship between 6 and the efficiency of Ypps- 


3.2 Efficiency of Y,», when 5 = 2 


For 5 = 2, (16) gives p, = —ZjeyX; =, which 
can also be written as 


1 5 
_=——— (EG): ley 
uke a es) (17) 
because 


<x 2)? #XAVENA SCV: 


x 
icU 


where ¥ =1/N and CV,=V,/X is the coefficient of 
variation of the X,. Substituting (17) into (14) gives 


ee = (N —n)(1+ CV; J 
N -(n-1(1+ CV) 

Hence, for 5 = 2, the efficiency of the randomized PPS 
sample is high when the variability among the X, 1s high. 
When CV, = 0, randomized PPS sampling amounts to 
SRS sampling and obviously, Effp,, =1 assuming 
(N —n+1)~(N-—n); note that this assumption holds 
when N is sufficiently large and n/N < f, <1. 

Observe that substituting n = npp,(1+ CV?) into (12) 
leads to about the same outcome as (3) and (10) with np. 
instead of n. Hence, when CV,=1.5, randomized PPS 
sampling with sample size ppg = 100 is as efficient as the 
ratio estimator from an SRS sample of size gps = 325. 
More generally, assuming that (n —1)/n ~ 1, it is seen 
from (3), (10), and (12) that a ratio estimator from an SRS 
sample of size Ngpg is as efficient as a PPS sample of size 
Npps When 


Nspg = —Npps PN. (18) 


3.3 Efficiency of Ypp, for 5 < 1 vs 621 


Another special case is 6 = 1. From (16), p, = —1/ N 
when 6 = 1. Subsequently, it follows from (14) that under 
model (15) Eff, =1+ O(N '), provided that n/N < 
fo <1 as N — © irrespective of the value of CV,. 
Furthermore, it can be shown that Effp,, is an increasing 
function of 5. This is proven below in Lemma |. Hence, 
for 5 <1 the randomized PPS estimator is less efficient 
than the ratio estimator, while for 6 > 1 the randomized 
PPS estimator is more efficient than the ratio estimator. 


Lemma |. Let Effp,p and p, be defined by (14) and (16), 
respectively. If V? > 0, then Effp,p is a monotonically 
increasing function of 6. 
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Proof. Write p, from (16) as a weighted mean of the 
(negative) X, 
p, =—u(8) = -) wX,, 


icU 


where 


Ww, = - [Note that 1, = u(2)). 


l 6-1 
pe Xx; 


Let X,>X, (i #j), and define (8) as w/w, = 
(X/X, yr. Since h(S) is increasing in 6, the weight of 
the larger X, is increasing compared to that of Y ; when 6 
is increasing. Hence, u(5) is increasing and p, is de- 
creasing in 6. It suffices therefore to show that Eff,,, is 
decreasing in p,. Writing (14) as 


seat) 


le = 
So rare 


it is seen that Eff,,, is decreasing in p, indeed. This 
concludes the proof. 


3.4 An alternative structure among the disturbances 


Finally, suppose the variance of the disturbances in (15) 
is of the form: 


var(é,) = ¢,X, + ¢,X/ (Oro erel) 


See Kott (1988). For this case we obtain in analogy 
with (16) 


p- Pe =) 10,X,, 


icU 
where 


1+ ox, 
een ee he are 


re) 


when @=0,p,=—I/N. Hence, when c, =0, PPS 
sampling is only as efficient as the ordinary ratio estimator 
from SRS sampling. Along the same lines as the proof of 
Lemma 1, it can be shown that p, is decreasing in ~ while 
Effp,, i8 increasing in @ Hence, for this case the ran- 
domized PPS estimator is always more efficient than the 
ratio estimator when c> is positive. 


4. An application to the Producer Price Index 


The Producer Price Index (PPI) in The Netherlands is 
based on about 2,500 commodity price indexes organized 
by type of product. The price index for a specific commod- 
ity can be written as 
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i= Spe XZ), 


where Z; is the price change for that commodity of 
establishment 7 relative to the basic period while Y, is the 
relative sales of that commodity by establishment i in the 
basic period (recall XX, = 1). 

In the example given here, we examine the price changes 
of 70 establishments for the commodity Basic Metal in 
December of 2005 relative to December of 2004; see Table 
1. We compare the variance of the ratio estimator from an 
SRS sample with the variance of the HT estimator from a 
randomized PPS sample when n = 9. Applying (12) to 
these data gives var(Y»p) = 101. If the sample had been 
drawn with replacement the variance would have been 116. 
Applying (3) and (9) for a randomized PPS sample gives 
var( ioe) = 29.9. This outcome takes y into account and 
lies close to the result V{%") =29.2 from a simulation 
experiment consisting of 80,000 randomized PPS samples 
of size n =9 from the set of 70 establishments. Hence, 
Eff, = 3.5. Because formula (12) for var(Y,) is only 
asymptotically unbiased, we also carried out simulations 
evaluating the mean square error (MSE) and the bias of Y . 
resulting in MSE" = 108 and a relatively small bias of 
0.7. This confirms the conjecture that (12) gives an 
underestimation of the true variance; see Cochran (1977). 
Hence, for moderate samples the true value of Eff), might 
be somewhat higher than (14) suggests. 

Furthermore, it is noteworthy that the simpler formula 
(10) for p, in combination with (3) gives almost the same 
result var(Ypp,) =30.7 even though N = 70 is not very 
large. The with replacement PPS variance would have been 
43.8. Hence, the variance reduction for randomized PPS 
sampling is more than 30% even though the sampling 
fraction n/N is much smaller. According to (18), formula 
(12) with nn, =26 gives about the same outcome as (3) 
with npg = 9; note: p, = —0.042. Hence, the sample sizes 
differ by a factor 2.9, which is more or less in line with the 
factor (1+ CV?) = 3.1 from subsection 3.2. This should 
not be surprising because the price changes and their 
variability hardly depend on the sizes of the company. 
Fitting a double log regression 


In(Z, - Y/ =a+BlnxX, + v, (19) 


results in the estimate f = 0.07 for the data in Table 1; units 
with Z, = Y should be omitted in the regression. The 
estimate 6 =0.07 corresponds with 6 =2.07 for the 
disturbances in (15) which explains the superiority of 
randomized PPS sampling for this type of data. Also for 
other commodities 6 often was about 2; see Enthoven 
(2007). 
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Table 1 
Price changes (Z;) and sizes (X;) of 70 establishments 
i price change size i price change size 
i -18.4% 0.0608 36 34.8% 0.0427 
2 -16.0% 0.0784 3) 13.1% 0.0121 
3 3.3% 0.0762 38 Seo 0.0351 
4 12.5% 0.0100 39 -24.8% 0.0074 
5 0.0% 0.0029 40 55.3% 0.0009 
6 8.3% 0.0006 4] 40.5% 0.0066 
7 -39.0% 0.0182 42 34.6% 0.0022 
8 -25.1% 0.0020 43 evo 0.0001 
9 Ns 0.0040 44 0.0% 0.0039 
10 4.4% 0.0066 45 3.9% 0.0304 
ital -4.9% 0.0039 46 25.4% 0.0209 
12 -8.9% 0.0070 47 25.6% 0.0062 
13 -7.0% 0.0148 48 0.0% 0.0033 
14 -15.0% 0.0108 49 -0.3% 0.0019 
15 -10.7% 0.0087 50 66.6% 0.0346 
16 -9.0% 0.1079 aul 0.0% 0.0039 
Uy -11.3% 0.0247 2 -2.9% 0.0007 
18 10.6% 0.0024 53 15.8% 0.0011 
9) -23.2% 0.0001 54 0.0% 0.0026 
20 -25.4% 0.0001 DS) 0.0% 0.0018 
Dl -80.7% 0.0002 56 11.6% 0.0057 
2) 13.4% 0.0005 Dif, 0.0% 0.0042 
23 -42.5% 0.0010 58 0.0% 0.0236 
24 -34.8% 0.0014 a) -1.5% 0.0015 
25 -30.0% 0.0126 60 0.0% 0.0003 
26 8.0% 0.0530 61 Me 0.0067 
27 0.0% 0.0208 62 0.0% 0.0012 
28 2.1% 0.0119 63 0.8% 0.0040 
DS 11.3% 0.0208 64 2.0% 0.0009 
30 0.7% 0.0322 65 2.3% 0.0018 
3 9.5% 0.0447 66 4.7% 0.0026 
32 11.5% 0.0018 67 0.9% 0.0064 
33 5.8% 0.0174 68 -1.0% 0.0309 
34 -6.9% 0.0197 69 -0.5% 0.0005 
35 0.0% 0.0124 70 0.0% 0.0006 


We conclude this section with a small example showing 
that randomized PPS is not always better than the ratio 
estimator. Although the data in Table 2 for a population of 
five units are artificial, a data pattern like this may occur in 
financial branches where very small financial companies 
may grow very fast with respect to certain financial vari- 
ables. This high variability among growth rates of small 
companies results in a low value for 6. For an SRS sample 
with n = 2 from the five units in Table 2 the variance of 
the ratio estimator is 211 according to (12); simulations give 
MSE‘ = 323. This is much less than the variance of 557 
found in a simulation consisting of 80,000 randomized PPS 
samples of size n = 2. Formula (3) in combination with (9) 
gives the same outcome: 557. This would also be the correct 
variance had sample been drawn according to Brewer 
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(1963a) or Durbin (1967). Formula (11), based on (10), 
gives a slightly different value, 556. 

Regression (19) with the data from Table 2 yields 
8 = —3.0, and hence 6 = —1.0. In line with the findings 
of subsection 3.3 this low value 6 = —1.0 explains why 
Y,og is less efficient than Y, in this example. Moreover, the 
ordinary direct estimator Ny, from an SRS sample has a 
variance of 356, which is even smaller here than the 
variance in randomized PPS sampling; y, being the sample 
mean of the Y,. Hence, for this type of data, the ratio 
estimator is the best option. Recall that the ratio estimator 
has a smaller variance than Ny, when b > Y/2X where 
b is the slope of a regression from Y, on X, and a constant 
(i = 1,..., N); see Knottnerus (2003, page 117). So the data 
Y, (= X,Z,) in Table 2 certainly do not exhibit a flat trend. 
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Table 2 
Growth rates of assets (Z;) and sizes (X;) of 5 establishments 
i growth rate size 
| 200% 0.0455 
2 33% 0.1364 
B) 75% 0.1818 
4 33% 0.2727 
5 62% 0.3636 
5. Summary 


This paper compares the variance of the HT estimator 
Yppg from a randomized PPS sample with the variance of 
the classical ratio estimator Y, from an SRS sample of the 
same size. In this comparison the sampling autocorrelation 
coefficient p, plays an important role. 

When the data pattern of the variables x and z (= y/x) 
is such that p, < —1/(N —1), it can be shown under mild 
conditions that Y,,, is more efficient than Y, for suffi- 
ciently large n and N, provided that X, and Z, are uncor- 
related. Under model (15) with E(e?) = o7 X° it holds that 
p, < —l1/(N —1) when 6 > 1. Hence, for this type of data 
Ypps_is to be preferred. Moreover, it emerges from (14) and 
(16) that for 6 = 2 the relative efficiency of PPS sampling 
compared to that of the ratio estimator is increasing when 
CV, is increasing. In addition, Y, is to be preferred when 
the data correspond to a model with 5 < 1. These findings 
are confirmed empirically with a simulation study using two 
different data sets. When model (15) is not applicable, the 
relative efficiency of Y,,, is given by (14) provided n is 
large and N is relatively larger. In practice the unknown 
p. in (14) is replaced by 6,,. The fact that n << N does 
not necessarily mean that the factor (n —1)p, in (3) is 
always negligible. 


Acknowledgements 


The views expressed in the article are those of the 
author and do not necessarily reflect the policy of 
Statistics Netherlands. The author would like to thank 
Peter-Paul de Wolf, Sander Scholtus, the Associate Editor 
and two anonymous referees for their helpful suggestions 
and corrections. 


Appendix A 


A counterexample 


Equations (5) and (7) cannot always be used for 
randomized PPS sampling when n and WN are of the same 
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order while X, and Z, are correlated. To see that, consider 
a population U consisting of two groups U, and U, with 
means Y, and Y,, respectively. Both stratum sizes are 
N/2. Let s be a randomized PPS sample of size n = 
3N/4 from the whole population U. Let the X, be such 


that 
i if 
T; = nX, = 
(OS) ibe 


pe OU; 
eeoUs. 


Obviously, group | does not contribute to the variance. The 
selected units in s from U, constitute an ordinary SRS 
sample of size N/4. Hence, for randomized PPS sampling 
the correct variance formula in this example is 


; AER oe, 
Spt tons (4) [ = oe 


and 


' 3 2 
Ne ee eel 0 Gee 60 a 
ye pe 21 a) 


However, approximation (11) gives an entirely different, 


larger outcome unless Y, = 2Y,. 
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The use of estimating equations to 
perform a calibration on complex parameters 


Eric Lesage ' 


Abstract 


In the calibration method proposed by Deville and Sarndal (1992), the calibration equations take only exact estimates of 
auxiliary variable totals into account. This article examines other parameters besides totals for calibration. Parameters that 
are considered complex include the ratio, median or variance of auxiliary variables. 


Key Words: Calibration; Complex parameter; Estimating equation; Calibration weight. 


1. Introduction 


In survey statistics, two main approaches are used in the 
estimation phase: “model-assisted” estimators (such as the 
regression estimator or the ratio estimator) and calibration 
estimators (such as the raking ratio), proposed by Deville and 
Sarndal (1992). The two approaches are somewhat similar, 
as shown by the regression estimator, which is the same as 
the calibration estimator with the y° distance (“linear” 
calibration method). 

The purpose of this article is to expand the family of 
calibration estimators. With the current method, calibration 
can be performed on totals. The idea is to be able to take 
into account the calibration constraints of complex para- 
meters or statistics such as a ratio, a median or a geometric 
mean. The reason for doing this is that auxiliary information 
may consist of a complex statistic rather than totals. For 
example, a ratio relative to the total population might be 
known, but not the total in the numerator or denominator. 

The issue of complex parameters in calibrations has been 
discussed in the literature. Sarndal (2007) reviewed a 
number of them, in particular the work of Harms and 
Duchesne (2006) on the calibration estimation of quantiles, 
and the work of Krapavickaite and Plikusas (2005) on 
calibration estimators of certain functions of totals. 

The originality of the approach in this article is that it 
reduces calibration on a complex parameter to calibration on 
a total for a new ad hoc auxiliary variable. The advantage of 
this approach is that current calibration tools can be used 
and that there is no need to solve a complex optimization 
program. 

In section 2 of the article, we review how the calibration 
method works, define calibration on complex parameters 
and describe simple cases in which calibration on a complex 
parameter can be reduced to calibration on a total. In 
section 3, we focus on parameters that can be defined as a 
solution to an estimating equation (Godambe and 
Thompson 1986). We introduce the concept of calibration 


on a complex parameter defined by an estimating equation 
and show that the resulting calibration equation can be 
replaced with an equation for calibration on a total. 


2. A complex parameter 
defined as a function of totals 


2.1 Review of calibration on totals 


Let U_ be a finite population of size N. The statistical 
units of the population are indexed by a label k, where 
k € {l,..., N}. A sample s is selected using sample plan 
p(s). Its size is denoted n and may be random. Let 1, be 
the probability that k is included in sample s, and let 
d, =1/, be its sampling weight. 

For any variable z that takes the values z, for the units in 
U indexed by &, the sum ¢, = ¥,-y Z, 1s referred to as the 
total of z over U. 

Let y'”,..., Y be QO variables of interest, whose values 
are known only for sample s, and let 0, be the parameter of 
interest that is a function of the totals ¢ ,, .. 


=i 6 


9, = f(t. -+ to): 
The estimator of 0, is 


Oe = ft, ress lioy yr 


Tt T 


It is simply the function /(-,...,-) with totals t ia) 
replaced by their Horvitz-Thompson estimator by = 
Yes 4,” (Sdrndal, Swensson and Wretman 1992). This 
estimator can be described as a substitution estimator. 

Let x), x” be P auxiliary variables known on s, and 
let L aye eres Lay be the totals on U for those auxiliary 
variables, also known. For an individual k, the vector of 
values taken by the auxiliary variables on k is denoted 
Xa (es 


The calibration estimator of Q, is 
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Oy cat = Syl: Vea 242), an.) 


) 


with be caL = Likes ¥, and a series of weights 


{w, }(,e,)> known as calibration weights (which should be 
denoted w,(s), since they depend on the sampling), 
obtained by solving the following optimization program: 
min dm, a) 
{We (kes) kes 

under constraints 

Cmca ha 

te) can = f «) 


d(-,-) is a pseudo-distance, i.e., a function that measures the 
difference between the calibration weight and the sampling 
weight (unlike a difference, a pseudo-distance is not neces- 
sarily symmetrical on its two arguments). The program is 
solved with a Lagrangian. When the distance used is the 
y° distance (i.e., d(w,,d,) = (1/2)(w,-4d,)°/d,), 
solution is w,= d,(1 + x.) (where A is a P-vector of 
Lagrange multipliers). 


2.2 Calibration on a complex parameter n, 


Definition 1: Let eat wabewle auxiliary variables 


known on s, and let n, = 8(t.)5 +5 ti) be a complex 
parameter, a function of the totals of those auxiliary 
variables, also known. 

In the case of calibration on the complex parameter 1,, 
the calibration weights are obtained by solving the 
following optimization program: 

min > 4m, 4,) 


I(kes) kes 


{w. 


under constraints 


Thx,caL a SC car? poe? tear) Sipe 

The totals 7 ,,, do not have to be known, but the complex 
parameter n, does. 

Consider the example of the ratio 


(1) 
ta) pars Xj 


R a x = 


x (2) 
Le) Noes Xi 


The calibration estimator of R, is of the form 


(1) 
D ee ee 


Ry cat a (Dy) * 
WX, 


kes 


The calibration equation in the case of calibration on a 
ratio 1s 
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R, is known auxiliary information, as the total of the 
auxiliary variables usually is. This scenario may occur when 
we have proportions that are well known and stable over 
time, for example, but the specific totals in the numerator 
and denominator are not known. 

We described the case of calibration on a single complex 
parameter, but it is clearly a simple matter to calibrate on 
more than one complex parameter. In that case, there are as 


many constraints as calibration parameters. 


2.3 Simple cases where calibration on a complex 
parameter can be reduced to calibration on a 
total 


It is not easy to determine from the outset whether an 
equation for calibration on a complex parameter can be 
written in the form of an equation for calibration on a total. 
In other words, it is not always a trivial matter to find a 
“new” auxiliary variable z, associated with the complex 
parameter, on whose total we can calibrate. 

For example, that is quite straightforward for all 
moments of an auxiliary variable x (it is assumed that under 
the sampling plan, the population size N can be estimated 
exactly). If p,, =N "Yeu xf’ is auxiliary information, we 
can simply take z, = x;'/N and calibrate on p,: 
Dred Wen LL ne 

If we want to calibrate on the variance and the mean of 
variable x with 1, and o* as auxiliary information, we can 
use the two new auxiliary variables 

7 == 
N 
and 
0) _ Oe =) 
Zz, = ; 

On the other hand, if we do not know u,, but we have 
o. in the auxiliary information and we want to calibrate on 
that variance, things become more complicated. We can see 
this if we write the substitution estimator of o; (where the 
sampling plan allows the population size N to be estimated 


exactly): 
2 
15g {5,[Zamct)). 
N zs N 


Finding a new auxiliary variable z is not straightforward, 
since the initial calibration equation is not linear relative to 
the weight vector. We will return to the variance case in 
section 3.3 below. 


oO, eS 


Ratio example 


Proposition 1: Calibration on a ratio is equivalent to 
ee on the total of the new auxiliary variable: z, = 
= Rx®. 
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The calibration equation is written 
Tecan be 0: 


Proof: 


lL. CAL = L 


1) 2 (1) (2) 
= wy — R xt => eke 


kes keU 
= bw car ~ R, La) CAL _ La, Nem R £2) cat ~ 0 
is 
xe, CAL 
—— ea ta aT ae ne 
Le) pies 
hel, Reo t= RE 


Function of a ratio of linear combinations of totals 


Let n, be a complex parameter that is a bijective 
function of a ratio of linear combinations of totals: 


a’-t 
= h| —— l 
Ns ta (1) 
with a’ =(q,,...,,) and B’=(B,,...,B,) being vectors of 


real coefficients of size P, and t! = Eis <5. 0) 


Proposition 2: Performing a calibration on complex 
parameter 1, defined by function (1) is equivalent to 
calibrating on the total of the new auxiliary variable: 


Z, = (a'— h(n, )B')* x, 


with calibration equation 


Proof: 
at CAL 
Ny CAL tlk = | 1 = | Nx 
B tcAL 
a’-t 
S etal Z- h'(n,) 
B *tocaL 


<(o h(n. )iB) team, = 0 


= Dw, (a! —A\(,) B’)-x, =0. 
kes 
Consider the example of the geometric mean: 


1/N 
HGeo,x = T] Xj e 


keU 
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This expression can be rewritten as 
In(x, ) 
HGeo,x = on Deu - p 


We denote x, = (x’, x) =(In(x,),1), a’=(1,0), B’= 
(0,1) and A! (uw) = exp '(w) = In(w). 
Hence, the new auxiliary variable is 


Zz, = In(x,)—In(Ug.,,,) +1. 


We will see later in the article that the estimating 
equations method provides another approach to displaying 
the new auxiliary variable(s) z. 


3. Parameter defined by an estimating equation 


3.1 Estimating with an estimating equation 


Certain parameters 0, are defined, or can be defined, as 
the solution to an implicit function known as the estimating 
equation on U (Godambe and Thompson 1986), i.e.: 

> P(,,y,) =0 
keU 


with y;, =()%”,...,y\°) being the vector of values taken 


by the variables of interest for individual k. 

In this context, an estimator of 0, is defined for sample 
s, denoted Bs 2s which is the solution of the estimating 
equation on s (see in particular Hidiroglou, Rao and Yung 
2002): 


d 4.0 (Ona Mie ) = 0. 


kes 


Table 1 

Examples of parameters defined by estimating equations on U 
Parameter O(0,,y,) Estimating equation on U 
mean [A (Y.-H) Dev Oe —H) =0 


ratio R=, / pW, Ce a Ry) 


Un <m 1/2) 


rey oe = Ry”) = 0 


median m ret yen 1 2) =O 


Consider also the example of the coefficient of a logistic 
regression. Let y"? be a dichotomous variable that takes the 
values 0 and | on U, and let y be a quantitative variable. 
The value y(” taken by y'” for unit & is assumed to be an 
instance of the random variable Yo , Which has a Bernoulli 


distribution 


1 
SRS ae, 
st eral 
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We have limited the number of parameters to one, but it 
would be just as simple to consider the multidimensional 
case. However, we should provide a definition of the 
estimating equations that take the case of the vector para- 
meters into account. 

The parameter of interest to us is the estimator of Bo, 
denoted B, calculated on the finite population by the 
maximum likelihood method. The estimating equation of B 
on U will be the maximum likelihood equation. The log- 
likelihood in the case of Bernoulli variables is 

Lo=> yo inp yy, ln). 
keU keU 


It is easy to derive the estimating equation of B on U: 


' 
oe Le |= 


keU 1+ exp(— By’) 


The estimating equation on s which defines the estimator 
B..., on the basis of the sampling weights is 


1 
Gay ®: 
> OPEN Wi 5 Bi 0 
kes I+ Sxp(=pyeey, ) 


The estimating equation is not linear in the parameter, 
Bee cannot be expressed as a simple function of the 
observations. 

The logistic regression example is very interesting 
because it shows that we do not need to know (pes to 
perform the calibration. We will see in the next subsection 
that we only need to know the generic term of the esti- 
mating equation on 


f | 
U,  ( ’ ii ve ve Si Lee Oa 
BIE Tee BE) 


forall kes. 


3.2 Calibration in the case of parameters defined by 
estimating equations 


Let x’,= (Came a we) be the vector of P known auxiliary 


variables on s, and let n, be a complex parameter, also 
known, defined by the estimating equation 

> Gr Xe) a 

keU 
Definition 2: In the case of calibration on the complex 
parameter ,, the calibration weights are obtained by 
solving the following optimization program: 

min Yid(m, d,) 


ih(kes) kes 


under constraints 


ms Wea X;) = Als 


kes 
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Proposition 3: Calibration on a complex parameter n,, 
defined by an estimating equation, is equivalent to a 
calibration on the total of the new auxiliary variable: z, = 
‘Y(n,,X,), with the calibration constraint >ip<s WZ. = 0. 


Definition 3: A calibration estimator of the parameter of 
interest 9,, denoted ®,..car» 18 @ solution to the 
estimating equation on s weighted by the calibration 


weights {W; } (kes): 


(On ea yx) = 0 


kes 


In most cases, the solution to the estimating equation is 
unique. The median is an example of a parameter for which 
there may be more than one solution. In this case, the 
infimum is often used as an estimator. 


Proposition 4: If there is only one solution to the equation 
Deerwe Ey cats Xe) en 


TN x,2e,CAL = Nx ss 


Proof: 1, is a solution to the estimating equation that 
defines f\, 2.ca,. Since there is a unique solution, we have 


Nxec,caL— "Ix: 


3.3. Calibration on a variance 


In this section, we examine calibration on variance o7, 
which is a more complicated complex parameter than those 
discussed above. We will show that when the variance is the 
only auxiliary information we have, we can perform an 
approximate calibration that produces calibration weights 
that have better properties than the sampling weights. 

Back to the variance case. The mean p1, and the variance 
o, on U of auxiliary variable x can be defined by two 
estimating equations on U: 


O, =) =0 (2) 
keU 
DGS wes, =0: (3) 
keU 


If we know the two parameters, calibrating on them is 
easy, since we merely have to calibrate on the totals of the 
two new auxiliary variables z= x — p, and z” = 
(x = wy = ce 

On the other hand, if we consider the textbook case 
where the mean 1, is not known, the parameter Go. cannot 
be defined by a unique estimating equation. If we replace 
uw, with its explicit definition 
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in equation (3), we obtain the equation 


2 
diet 
U 2 
»~ xX; — Set le * — os = 0, 
aye de 
keU jeU 


which cannot be written in the form of an estimating 
equation: } <1 Vic, Xe) yv=10; 

uu, thus becomes a nuisance parameter (Binder 1991). 
To overcome this difficulty, we can replace it in equation 
(3) with its substitution estimator: fi, , = 7,,/N,, with 
N, =), d,1 being the Horvitz-Thompson estimator of 
the size of population U. This leads to the “approximate” 
calibration equation 


2 


ig? ; 2) 
[4 PG il=:(), (4) 


>” 


kes Tt 


Proposition 5: With estimating equation (4), calibration on 
the variance is not perfect, and we have 


A A 23 
t t 
nA tiple” poem x,CAL 
OF Oe a oO. -( om is (5) 


to 


Proof. 


- The “approximate” calibration equation is equation (4). 


- The definition of the parameters’ calibration estimators: 


>, (a5 eek) =e!) 


kes 


fe 2, end = 
> (Ciesla Ons cA) 0 
kes 


This can be rewritten 


DSi Wi Xx am t..CAL 


Ly ee,CAL = 


A 7) 
L. CAL a2 = 0 
Sn — a TS yeGNt, || > 


CAL 


> 


kes 


- If we subtract the second estimating equation from the 
approximate calibration equation, we get 


A 2 A 2 
t Tt 
X,T x,CAL PA ae) ws 
[- =] za [.- N | = Oye ©. e,CAL = 0) 
kes CAL 


Using the identity a — b> =(a—b)(a+b), we have 
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L.CAL be O Sage LAL 
Di Wy N ae baz x, roars it ia 
kes CAL N, IN Neat 


2 


. a z 
se eri (ae On ea) =U 


peerihass hodle? 
| ieee _ xo pa [2s cic a = eae 
Neat N kes iN, Neat 


“ ee “i 
= Wear (On Oe 


“ Pu yh = 
aV cay (OE Opin) = 9 


> 


P t eles 

ACA xX, 2 a2 ae 

Rea (4 = ny | — Neat (Oy Oost, = U 
CAL 


This is the same as the expression for Gy. cq, in 
equation (5). 


This result is interesting because, without an exact 
calibration, we have a calibration estimator of o~ that is 
asymptotically more precise than the substitution estimator 


A 


6*,. That is, if we resort to the asymptotic framework 


typically used in surveys and employ linearization of 
complex estimators (Deville 1999), we have 


and 


. a1 | 
(62 car EG?) -( xn '4,CAL es 0, | } 
N Wey n 


This yields 


4. Conclusion 


In this article, we presented a simple method of 
performing a calibration in cases where the auxiliary 
information takes the form of a complex parameter. That 
method is based on the concept of the estimating equation. 
Its major advantage is that it can be used with current 
calibration software. 
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In future research, it would be interesting to determine 
the practical cases in which the use of complex parameters 
in the calibration improves the precision of the parameters 
of interest. 
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Waksberg Invited Paper Series 


The journal Survey Methodology has established an annual invited paper series in honour of Joseph 
Waksberg, who has made many important contributions to survey methodology. Each year a prominent 
survey researcher is chosen to author an article as part of the Waksberg Invited Paper Series. The paper 
reviews the development and current state of a significant topic within the field of survey methodology, and 
reflects the mixture of theory and practice that characterized Waksberg’s work. 


Please see the announcements at the end of the Journal for information about the nomination and 
selection process of the 2012 Waksberg Award. 


This issue of Survey Methodology opens with the tenth paper of the Waksberg Invited Paper Series. The 
editorial board would like to thank the members of the selection committee Daniel Kasprzyk (Chair), 
Elisabeth A. Martin, Mary E. Thompson and Wayne Fuller for having selected Danny Pfeffermann as the 
author of this year’s Waksberg Award paper. 


2011 Waksberg Invited Paper 
Author: Danny Pfeffermann 


Danny Pfeffermann is Professor of statistics at the Hebrew University of Jerusalem, Israel, and at 
Southampton Statistical Sciences Research Institute (S3RI), University of Southampton, UK. For the 
past 15 years he is also a consultant for the US Bureau of Labor Statistics. His main research areas are 
analytic inference from complex sample surveys, seasonal adjustment and trend estimation, small area 
estimation, and more recently, observational studies and nonresponse. Danny served for two years as 
the president of the Israel Statistical Association and is the president elect of the International 
Association of Survey Statisticians ([ASS). He is co-editor of the new two-volume handbook in 
Statistics on “Sample Surveys”. 


114 Waksberg Invited Paper Series 


Waksberg Invited Paper Series 


Preface from the author 


It is a great honour to receive the award named after Joe Waksberg. I am old enough to have had the 
fortune of meeting Joe on several occasions, the last time being a whole day of professional meetings at 
Westat, discussing nothing else but my own modest contributions to survey sampling. What I remember 
from these meetings is Joe’s brilliance, profound knowledge and sharp intellect, even at his very advanced 
age. I would be lying if I say that I was able to answer all his critical questions. 


I feel even more honoured and privileged when I look at the list of all the eminent survey statisticians 
who received the award before me. While I am still trying to convince myself that I deserve being on that 
list, 1 am overwhelmed by all the sincere congratulations and good words from colleagues around the world 
and during the symposium. What can I say, I am very proud and grateful. 


On this occasion, I would like to commemorate also one of the founders and the long serving editor of 
Survey Methodology, the late M.P. Singh. In 1993 I published a paper in the /nternational Statistical Review 
entitled “The role of sampling weights when modeling survey data”. This paper was well received and when 
I met M.P. a couple of years later, he sort of complained to me for not publishing the paper in Survey 
Methodology. Not having a convincing answer, I promised M.P. that one day I would write another paper on 
this topic and submit it to Survey Methodology. I feel that with the present paper I have kept my promise to 
M.P. Singh. 


Danny Pfeffermann 
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Modelling of complex survey data: 
Why model? Why is it a problem? How can we approach it? 


Danny Pfeffermann ' 


Abstract 


This article attempts to answer the three questions appearing in the title. It starts by discussing unique features of complex 
survey data not shared by other data sets, which require special attention but suggest a large variety of diverse inference 
procedures. Next a large number of different approaches proposed in the literature for handling these features are reviewed 
with discussion on their merits and limitations. The approaches differ in the conditions underlying their use, additional data 
required for their application, goodness of fit testing, the inference objectives that they accommodate, statistical efficiency, 
computational demands, and the skills required from analysts fitting the model. The last part of the paper presents 
simulation results, which compare the approaches when estimating linear regression coefficients from a stratified sample in 
terms of bias, variance, and coverage rates. It concludes with a short discussion of pending issues. 


Key Words: Informative 
Randomization distribution; Sample model. 


1. Introduction 


Survey data are frequently used for analytic inference on 
statistical models, which are assumed to hold for the 
population from which the sample is taken. Familiar exam- 
ples include the estimation of income elasticities from 
household surveys, the analysis of labour market dynamics 
from labour force surveys, comparisons of pupils’ achieve- 
ments from educational surveys and the search for causal 
relationships between risk factors and disease prevalence 
from health surveys. An important common feature to all 
these examples is that interest lies in the structure of the 
models being estimated and what can be learnt from them. 
This is different from fitting models merely for prediction 
purposes, such as when predicting finite population totals or 
in small area estimation, where the structure and interpre- 
tation of the model are of secondary importance. Models are 
also used implicitly for choosing the sampling design and 
estimators, such as in stratified sampling, or when defining 
weighting cells for nonresponse adjustments. However, in- 
ference is typically based in these cases on the randomization 
distribution over all possible sample selections, and not on the 
model, which is known as ‘model assisted inference’. 

Survey data typically differ from other data sets in five 
main aspects. 

1. The samples are selected at random with known 
selection probabilities, which allows using the ran- 
domization distribution over all possible sample 
selections as the basis for inference instead of the 
hypothetical distribution underlying the population 
model. As discussed below, a combination of the 
two distributions is in common use. 


1. Danny Pfeffermann, Southampton Statistical Sciences Research 
d.pfeffermann@soton.ac.uk. 
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2. The sample selection probabilities in at least some 
stages of the sample selection are often unequal; 
when these probabilities are related to the model 
outcome variable, the sampling process becomes 
informative and the model holding for the sample is 
then different from the target population model. 

3. Survey data are almost inevitably subject to various 
forms of nonresponse, often of considerable magni- 
tude, which again may distort the population model 
if the response propensity is associated with the 
outcome of interest (not missing at random non- 
response). 

4. The sample data are often clustered due to the use of 
multi-stage cluster samples. The clusters are ‘natural 
units’ (households, individuals in case of longitu- 
dinal surveys...), implying that observations within 
the same cluster are correlated. 

5. The data available to the modeler may be masked 
(“swapped”, “contaminated”, suppressed”) in order 
to protect the anonymity of the respondents. When 
this is the case, the modeler’s data differ from the 
correct data. 


Many approaches have been proposed in the literature for 
estimating population models from complex survey data 
possessing these features, some of which are more familiar 
than the others. The approaches differ in the conditions 
underlying their use, the data required for their application, 
goodness of fit testing, the inference objectives that they 
accommodate, statistical efficiency, computational demands, 
and the skills required from analysts fitting the model. This 
heterogeneity means that there does not exists any single 
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approach that can be considered as best in all situations. 
That being the case, a fundamental question arising is which 
approach or approaches could or should be used for a given 
practical application. 

The present paper is divided into three parts. In the first 
part (Section 2) I elaborate on the first four features of 
complex survey data mentioned above. In the second part 
(Section 3) I review the various approaches proposed in the 
literature for dealing with these features, discussing their 
merits and limitations in light of the properties mentioned 
above. In the third part (Section 4) I present simulation 
results which compare the approaches when estimating a 
linear regression model from a stratified sample in terms of 
bias, variance, and coverage rates. I conclude with a short 
discussion of pending issues in Section 5. 


2. Why are survey data different 
from other data? 


2.1 The problem of unequal sampling probabilities 
and nonresponse 

Consider a finite population U = {l,...,N} with 
measurements {y,, X;,Z,} for unit i =1,...,.N, where y 
represents an outcome variable of interest, x a vector of 
covariates and z a vector of design variables used for the 
sample selection. The design variables may include some or 
all of the covariates, and in special cases also the outcome 
variable when known for all the population units, such as in 
case-control studies. The matrix Z,, = [Z,, ..., Zy] is known 
to the sampler drawing the sample, but not necessarily to the 
analyst fitting the model. Denote by s = (1,,...,1,) the 
selected sample, where I, is the sampling indicator taking 
the value | if unit i ¢ U is drawn to the sample and 0 
otherwise. In practice, not all the sampled units necessarily 
respond, and we denote by R, the response indicator; 
R, = 1(0) ifunit i € S responds (does not respond). 

The observed data may be viewed as the outcome of 
three random processes. The first process generates the 
vectors {y,, X;, Z;} for the N population units. The second 
process selects a sample s from U at random by a sam- 
pling design, Pr(s) = Pr(s | Z,,). The third process selects 
the responding units. This process is obviously not part of 
the original sampling design and is often the result of ‘self 
selection’, although nonresponse could be caused by many 
other reasons. See Brick and Montaquila (2009) for a recent 
overview. 

When the sample selection probabilities and/or the re- 
sponse probabilities are related to the values of the outcome 
variable even after conditioning on the model covariates, in 
the sense that Pr(I;=1| y,, x,)#Prd,=1|x,) or Pr(R,= 
1| y;, x; 1, = 1) # Pr(R, |x;, 1, = 1), the model holding 


for the observed outcomes is different from the population 
model. In symbols, f,(y;|x;)# f,(9;|x;), where /,(y;|x;) 
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represents the model holding for a unit selected to the sample 
and responding, and f,,(y,|x;) is the population model (the 
model holding for the population values). See Equations 
(2.1) and (2.2) below. 


Example 1. Suppose that the population model is the 
regression model, f,(¥; | x;) = N(x;B, o.), and that the 
sample is selected with selection probabilities satisfying 
Pr(Z,=1\\y, x)= exply, ye 5 ve g(x;)], where y, 
and y, < 0 are constants and g(x,) is some nonstochastic 
function of the covariates. Simple use of Bayes theorem (see 
below) shows that the model holding for the sample 
outcomes is in this case, f,(y,| x,) = N[(y,0°+x!B)/C, 
o./C], where C = (1 — 202y,). Thus, although the sam- 
ple residuals have again a normal distribution, the regression 
coefficients and the residual variance are different from their 
values under the population model. In the special case 
y, = 0, the slope coefficients and the residual variance are 
the same as under the population model, but not the inter- 
cept. If y, = 0 as well, the sample selection probabilities 
satisfy Prd,=1| y,,x,) =Prd,=1]|x,;) and the two 
models are now the same. 

Following conventional terminology, when Pr(I; = 
1| y,, x,) # Pr(; = 1|x,) the sampling design is said to be 
informative. When Pr(R,=1| y, x;, 1,=1) ¥ Pr(R| x;, 
I, = 1), the nonresponse is not missing at random (NMAR 
nonresponse). Notice that whereas the sampling proba- 
bilities are typically known to the analyst fitting the model, 
at least for the sampled units, the response probabilities are 
generally unknown and need to be modelled under NUAR 
nonresponse. Ignoring an informative sample or NMAR 
nonresponse and thus assuming implicitly that the model 
holding for the observed outcomes is the same as the target 
population model may yield large biases and erroneous 
inference. The books edited by Kasprzyk, Duncan, Kalton 
and Singh (1989), Skinner, Holt and Smith (1989) and 
Chambers and Skinner (2003) contain many discussions and 
illustrations of the effect of ignoring informative sampling 
or NMAR nonresponse. See also Pfeffermann (1993, 1996), 
Pfeffermann and Sverchkov (2009) and Pfeffermann and 
Sikov (2011) for further discussions and examples, with 
many other more recent references. 

In what follows, I use the abbreviation “pdf” to define 
the probability density function when the outcome is 
continuous or the probability function when the outcome is 
discrete. Suppose first that there is no nonresponse. Follow- 
ing Pfeffermann, Krieger and Rinott (1998a), the marginal 
sample pdf, f,(y; | X;) defines the conditional pdf of y, 
given that unit 7 is in the sample (I, = 1). By Bayes 
theorem, 

F,0% 1%) = £0, | XZ, = D 
phir |x, ¥)F,0% | X;) 


Prt aie 
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where /,,(y, | x;) is the corresponding population pdf. The 
probabilities Pr(J; = 1 | x,, y,) are generally not the same 
as the sample selection probabilities 1, = Pr(I, = 1), 
which may depend on all the population values Zy of the 
design variables. However, the use of the marginal sample 
paf only requires modelling Pr(J, = 1| x,, y,). Typically, 
PrU;=1| %;, ¥;,x;) = 7, in which case Pr(J, =1| y,, 
X,) = E,,(m; | y,,x;), where E’,(:) is the expectation under 
the population pdf. 


Remark 1. In practice, the covariates featuring in the 
population model need not be the same as the covariates 
featuring in the model of the conditional sample inclusion 
probabilities, Pr(/; =1|x,,y,). In fact, following the 
results in Pfeffermann and Landsman (2011), identifiability 
of the sample model often requires that the two sets of 
covariates are not identical. However, to simplify the 
presentation in this paper, I assume for convenience that the 
covariates contained in the population model and the 
covariates defining the conditional inclusion probabilities 
are the same, or alternatively, that x, defines the union of 
the two sets of covariates. 


It follows from (2.1) that unless Pr(/,=1 | x,, j= 
PrU/,=1|x;)Vy,, the sample pdf is different from the 
population pdf, in which case the sampling design is 
informative and cannot be ignored in the inference process. 
In particular, it follows from (2.1) that under informative 
sampling, 


Prd, =1| x;, y,)y; 
F(x) =| eS bee 


Pr(J, = 1|x,) 


| # E (9; |%;); 


where £,(-) is the expectation under the sample pdf. 
Estimating E »(¥; | X;) is often the main target of infer- 
ence, illustrating that ignoring an informative sampling 
scheme and thus estimating implicitly E,(y, | x,) can bias 
the inference. 

Suppose now the existence of NMAR nonresponse. The 
marginal sample pdf (2.1) can be extended to this case by 
defining, 


LO | X;) = Val Xi, I, =1, R, = 1) 


_ Pr@ =| y, x, =DPr(=11 y, x) f(x) 
™ Pr(R,=1| x,, J,=1)PrU,=1|x,) 


“ Pr(R, = 1] y;, X;.1,= DA,0,1 x) 
Pe (RSS 1) 


Notice from (2.2) that unless Pr(R,=1| y,, x,, 1,= 1) = 
Pr(R; | x;,1,= ))Vy,, the pdf holding for the observed 
outcomes is different from the sample pdf Here again I 
assume for convenience that the response probabilities 
depend on the same covariates as in the sample model. See 


Remark | above. 


G2) 


M7 


The pdfs (2.1) and (2.2) define the marginal distributions 
of the outcome for a given unit. These definitions generalize 
very naturally to the joint pdf of two or more outcomes 
associated with different units. More generally, define for 
every plausible sample s c U_ the sample indicator A, 
such that 4, =1 if s is sampled and 4, = 0 otherwise, 
and assume for convenience full response. Denote the data 
associated with s by (y,,x,). The joint sample pdf of 
y,..|-%.s'then, 


(AO AL?) baw AO Aleck) 


Breaths) . 


(23) 


The pdf f,(y, | x,) can be general, allowing in particular 
for correlated measurements, but modelling the probability 
Pr(A, =1| y,,x,) is practically only feasible if the 
sample can be decomposed into exclusive and exhaustive 
subsets s, such that Pr(A, =1| y,, x,) II, Pr(A, = 
LY s aX _) and Pr(4, =1| Y,,>X,,) Satisfies the same 
model for all the subsets (see Betole 2). In particular, if 
the population outcomes are independent given the co- 
variates under the population model and Pe( ally. 
X,) © []iesPrd; =1]| y,, x,), (2.3) takes the form 


ies Pri Vix 2) 
= Het | X;); 


so that the sample outcomes are likewise independent. 


(Cues ll 


(2.4) 


Example 2. Consider the case of a clustered population 
U=U,U;, with independent measurements between 
clusters, such that f, Yul Xu) = TLS, (Yu,| Xy,), where 
(Yy, Xy) defines all the population values and (vy, »Xy) 
the values in cluster /. Let s define the set of sampled 
clusters, assumed to be drawn independently with proba- 
bilities Pr es | yy, Xy,) = r(Yu,,Xy,) for some func- 
tion r(-), and suppose also that all the units in the sampled 
clusters are observed (single-stage cluster sampling). Then, 
Pr(4j=1| Yo. %u)'= Tees r(Yu,» Xu,) X Tjesll — r(Yy,» 

Xy )]. Since for kes, (Yu, Xy )= (y,. x, ), it follows that 
Pr(A, =i yaexe) = her Gx) 6G ‘where for given 
covariates Xy, J €5, Gis a constant satisfying, G = 
Dy cd Uh= Pg ee, NA, Qu, | Xu, )dy,,. The case of a 
non-clustered population with independent measurements 
and Poisson sampling of individual units is a special case 
where each cluster consists of a single element, giving rise 
to (2.4). 


Remark 2. The examples considered so far assume 
independent sampling, which preserves the independence of 
the outcomes after sampling, but this assumption can 
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usually be relaxed following a result proved and illustrated 
in Pfeffermann et al. (1998a). By this result, under some 
general regularity conditions and for many commonly used 
sampling schemes for selection with unequal probabilities, 
if the population measurements are independent, the sample 
measurements are asymptotically independent under the 
sample distribution. The asymptotic framework requires that 
the population size increases but the sample size is held 
fixed. As illustrated in section 2.3, the assumption of 
independent population measurements is often also not 
restrictive. 

So far, we suppressed for convenience from the notation 
the parameters underlying the population pdf and the 
sampling process. Consider, for example, the sample pdf 
(2.3). With added parameter notation, it can be written as 


Pr(4)=1| ye) 7,0, |e D 


-1( 225) 
Pr(4.=1)x,;,9, ¥) 


SVs | X 5 0, Y) = 
Thus, the conditional population and sample pdfs are 
different, unless 


Pr(A,= 1 | Vie Pee = 1| KS ONY )IN yo: NG) 


When (2.6) holds, inference on the target parameter 8 can 
be implemented by fitting the population model to the 
sample data, ignoring the sample selection. Note that this 
conclusion refers to the selected sample defined by the event 
Ava. 

The condition (2.6) is a strong condition. In a funda- 
mental article on missing values, Rubin (1976) establishes 
conditions under which the sampling process can be ignored 
for likelihood, Bayesian or sampling theory (repeated 
sampling from a model) inference, that is, conditions under 
which the population model defined by f(y, | X,; 9) can 
be fitted to the observed data, depending on the inference 
method used. Little (1982) extends Rubin’s results by 
distinguishing between the sample selection and the 
response process. Another important distinction is that Little 
conditions on the population values Z,, of the design 
variables used for the sample selection. Inference on the 
target population model f,(y,|X,;9) requires therefore 
integrating the conditional pdf of y,|Z,, x, over the 
distribution of Z,,| x, (see Section 3). Sugden and Smith 
(1984) establish conditions under which a sampling process 
that depends on design variables Z is ignorable, given 
partial information on the design. Let d, = D,(z,) contain 
all the available design information for a sample s such as 
strata membership (may only be known for the sampled 
units), sample selection probabilities efc. Using previous 
notation, a key condition for ignorability of the sampling 
process given the available design information is that 
A, 1 Z,, | d,, with “ .” meaning independence, implying 
Pr(A,=1|Z,=2,) =Pr(4,=1|d,) for any z, for 
which D,(z,,) = d,. 
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For large scale multi-stage sample surveys with possibly 
many design variables, it is generally difficult and often 
impractical to check directly the conditions that permit 
ignoring the sample selection or nonresponse given the 
available design information. On the other hand, even when 
the sample pdf is different from the population pdf, it does 
not necessarily imply that inference that ignores the 
sampling process is wrong. As a simple illustration, con- 
sider the special case of Example 1 where y, = 0. In this 
case the sample pdf is normal with the same slope coeffi- 
cients and residual variance as under the population pdf. 
Thus, for inference about the slope coefficients one can 
ignore the sampling process. A similar result holds for 
logistic models when the sample selection depends on y 
but not on x. See Pfeffermann ef a/. (1998a) for derivation 
of this result. Pfeffermann and Sverchkov (2009) review 
several test statistics proposed in the literature for assessing 
whether ignoring the sample selection is justified for the 
intended inference. 


2.2 The use of the randomization distribution for 
inference 

A unique feature of sample surveys is that the sample is 
selected at random by use of a sampling design [{s, Pr(s)}, 
séS]. The sampling design induces a (discrete) ran- 
domization distribution for any statistic 7,,, which is the 
conditional distribution over all possible sample selections, 
given the finite population values. Thus, the statistic T,, 
takes the value ¢,, with probability Pr(s), s¢S. Classical 
survey sampling inference is based solely on this distri- 
bution. For example, the familiar Horvitz-Thompson (HT) 
estimator TT, which takes the value #1" =5j.,(y,/1,) if 
sample s is drawn, is randomization-unbiased for the finite 
population total TOT,=>/iy,, since Y,-sPr(s) ty = 
T,. Its variance is, Var(T"") = D,<sPr(s)@" —T,)°. No- 
tice that in the case of nonresponse, the use of the ran- 
domization distribution requires knowledge of the response 
probabilities, which in practice can only be estimated. 
The HT estimator takes in this case the form, 7!" = 
Dier ¥; /[t; x Pr(R,=1|1,=))], where R defines the sub- 
sample of respondents. See Fuller (2002) for further 
discussion. 

The randomization distribution conditions on the realized 
population values. Consequently, it can be used for descrip- 
tive inference on known functions of the finite population 
values, but not for analytic inference on a hypothesized 
model giving rise to these values. For this, one may consider 
the joint distribution over all possible sample outcomes for 
given population values (the randomization r-distribution) 
and all possible realizations of the finite population mea- 
surements (the model p-distribution). See Binder and 
Roberts (2009) and the references therein. The combined 
r—p distribution offers an alternative framework of 
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inference to the use of the pdf& f.(y | x) or Ft, | x) de- 
fined before. 

Example 3: Suppose that the population model is y;~ 
Mules K], such that Pr SOR] kh) = py = weak 
SO ale Let Pr(ies|y, Bhs m,. Then, by 2.1), 
Pr,(y, =k) = Pr(y, = =klies)=1,p,./ dian, ;Pj= Dis or, 
y, |ies ~ Mult({ Dib K). Assuming independence of the 
observed outcomes and known selection probabilities, the 
maximum likelihood estimator (m/e) of Pp, based on the 
sample distribution is p, =(n, /m,)/ > i(1,/m,), where 
n, is the number of sampled units with outcome y, =k. 
The use of the r — p distribution suggests estimating p, by 
the HT estimator p, = (iN) >i pe / m,)=(m, /m,)1N. 
The estimator p, is randomization-unbiased for P= = 
N,/N, where N, is the number of population units with 
outcome y, =k, and P is p-unbiased for p,, such that 
Dp, is r- Peinbisd 100 Dye 


The obvious difference between the r — p distribution 
and the sample distribution, f(y |x), is that the latter 
conditions on the observed sample of units (and hence the 
observed values of the covariates or the selected clusters in a 
cluster sample), whereas the r — p distribution accounts 
for all possible sample selections. Consequently, the use of 
the latter distribution does not lend itself in general to 
conditional inference. The use of the pdfs f(y |x) or 
F,(y | x) requires Troe Pr(U/; =1| x;, y,) (Equation 
2.1) and Pr(R, = 1| y;, X;, 1; = 1) in case of nonresponse 
(Equation 2.2), but it permits the computation (estimation) 
of the conditional pdf of the observed outcomes given the 
covariates, and hence the use of classical inference tools. 


2.3 Data obtained from a cluster sample 


Another special feature of survey data mentioned in the 
introduction is clustering, due to the use of multi-stage 
cluster samples. The clusters are ‘natural groups’ such as 
households, residence blocks, schools, or even individuals 
in the case of longitudinal surveys. Consequently, the out- 
comes pertaining to the same cluster are generally corre- 
lated, known as the intraclass correlation. It is important to 
emphasize that the clusters represent an existing population 
grouping, such that an intraclass correlation exists also 
under the population model. 

Pfeffermann and Smith (1985) review several classes of 
plausible regression models for clustered populations, and 
discuss how they can be estimated from the sample. A popu- 
lation model in common use is the random intercept model, 


Vy = x, B+ u, i re ie & BE ES PA ee 
oe Dp. 2 
u, ~ N(0,0;);€, ~. N(0,02), (2.7) 


where M defines the number of clusters and N, the num- 


ber of units in cluster 7. The model assumes also E (UE,) = = 


0,Vi, j. Notice that under this model Var( Vy) = o. +0., 
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EQyyg) =o; for j #1 and E(¥; Yq) = 0 for i x k, 
implying 
Corr (y,;, ¥.) = 0./(6, +02) for j #1; 
(2.8) 
Corr(y;,, ¥.) = 0 for i # k. 


Scott and Holt (1982) show that estimating Bin (2.7) by 
ordinary least squares (OLS) usually results in a small loss 
of efficiency, compared to the use of the optimal generalized 
least squares (GLS) estimator. However, ignoring the intra- 
cluster correlation when estimating the variance of the OLS 
estimator may result in considerable variance underesti- 
mation and hence wrong size and excessive powers of test 
statistics and too short confidence intervals. 

The results in Scott and Holt (1982) and Pfeffermann and 
Smith (1985) assume noninformative sampling and full 
response. When this is not the case, the model holding for 
the sample data is different from the corresponding popula- 
tion model, although the clustered nature of the model is 
preserved as we now show. Consider the following two- 
level population model: 


Level 1: u,| t, ~ @,(u;| t;; 9), 7 = |... Ml 29) 
Revel2. 2 iG Oy a 1, 8: = ene 


where @, and f, denote the first and second-level pdfs 
with known covariates t, and X;,,_ governed by the hyper- 
parameters 0, and 0, respectively: The model (2.7) is a 
special case of (2.9) by which 9, and f, are normal pdfs 
with t; = 0 (no covariates), 0, = 0; and 6, =(Brosy 
Suppose that the sample is eich by the following two- 
stage sampling process. In the first stage a sample s, of 
m < M first-level units (clusters; say, schools) is selected 
with probabilities 1, = Pr(i € s,) that may be correlated 
with the random effects u, after conditioning on the 
covariates t, In the second stage a sub-sample s,, of 
n, < N, second-level units (ultimate sampling units; say, 
pupils) is sampled from each selected first-level unit i with 
probabilities 7, = Pr(j € s,,|i € s,) that may be corre- 
lated with the outcomes y, after conditioning on the 
covariates x,. Denote by I, and I, the first and second- 
stage ceri indicators. By (2: 1), fhe two-level sample 
model holding for the observed data, corresponding to the 
population model (2.9) is, 


Level 1: 
FU; | ts 9, 11) 
Pr(I,=1|u, ts y,)0,(u; | t3 4) 
Pra 10; Olay) 
Level: | cay 
F,,, Vy | Xo Up 95 Yo) 
_ Prlge= 1) yy Xp Yo Di, =." 3 93) 
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where I assume Pr(I 
1 | Vy, X¥ Yo): 


=1| Vy. Uj» Xy3 Y2) = Pr 


jli ili 
Remark 3. By the independence result in Remark 2, if 
y,, | u; are independent under the population model, they 
are asymptotically independent under the sample model. 
Similarly, if the random effects u, are independent under 
the population model, they are asymptotically independent 
under the sample model. Thus, the sample model (2.10) is a 
genuine two-level model, although with different distribu- 
tions and possibly more parameters. Evidently, the models 
(2.9) and (2.10) are different, unless Pr(I,,=1| y,, X;) = 
Pr(I,,=1|u,, x,) and Pr(l;=1] u,,t,) = Prd, = 1] t,). 

So far I assumed implicitly full response. Suppose, for 
example, that in sampled cluster (first level unit) 7 only a 
sub-sample 7, < s,, respond, and denote by Rj, the re- 
sponse indicator. The second-level model for the observed 
outcomes is now, 


evelule 
Sori Vy | X joUjs OY a) 
= LO% | X Ui i, = 1, R yi = 1) 


me Pr(R,, = i] y Vio Xijs i =I; > VAs, (Vij Riot nei ey) 
Prk Ah = ils 855745575) 


ji 


(2.11) 


The pdf (2.11) coupled with the level 1 pdfin (2.10) defines 
the model holding for the observed data in the case of 
informative cluster sampling and NMAR nonresponse. 


3. How can we estimate population models from 
complex survey data? 


In this section I review the main approaches proposed in 
the literature to deal with the special features of complex 
survey data discussed in Section 2, and propose some 
modifications. In order to simplify the discussion, I consider 
the following set up used for the simulation study in 
Section 4. 


3.1 Population model and sampling design 


Consider a stratified population U = U, U...UU,, of 
size N. Specifically, define for every unit 7 e U arandom 
vector stratification indicator z= (2, jo --2 Zn ) such that 
Pr(z,,= 1) = Py Dir P,=1 and j €U, if z, =1. The 
stratification is carried out independently between the units. 
Values of an outcome variable Y are generated as y,= 
By + Bx, +056, + 0,6,%,+ €,,€,-N(0,0 ), ‘where! the 
x,’s are fixed scalar covariates, (Bo, By, Xo, O,) are fixed 
coefficients and 
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ae Lwoaw 2 
im h=l : 


H 4 Ph 


is a random variable with mean zero and 


| ft ll 
V.=|— > | -b 
i o] 


implying that for given covariates x,, x,, 
E,(y;| x;) 


Notice that ¢, 
variance 


= By +B,x;, Var, (4 x,) 


= (01) +0,%,) V. ieee Clos pVj> Vel Xj %) 


eM Anke (3.1) 
However, for unit 7 € U,, 
y,| Nix Zp = 1~ N[(By + a6,) 


# (B+ 06, ) 4.0 1G = (Cape) 


Thus, the regression model in each stratum is the classical 
linear model with constant variance, but the intercepts and 
slopes change across the strata. 

The model defined by (3.1) and (3.2) is a realistic 
random coefficients regression model, which I think mimics 
many populations encountered in practice. 

We used systematic probability proportional to size 
(PPS) sampling within the strata for drawing the samples 
with the size variable defined as z; = max {min[(|q,|)'”, 
9}, 1}; q,~ NU + x,, 1). There is nothing novel about the 
choice of this size ae except that it allows for a clear 
distinction between the variance of the various estimators. 
This size a does not depend on the outcome y,, and 
hence the sampling process within each stratum is non- 
informative. However for disproportionate allocation of the 
sample between the strata, the sampling scheme is infor- 
mative because of the different models operating in different 
strata, such that the observed outcomes carry information on 
the strata membership and Pr(jes | y,, x;) # Pr(jes|x;,). 
We focus on the estimation of the regression coerced 
(B,. B,) in (3.1) as the target of inference and assume that 
the available sample information consists of the observed 
outcomes and covariates, the strata membership vectors z,, 
and the strata sizes, {N,}. 


3.2 Including the design variables among the 
covariates 
As implied by (2.3), the population model (pdf), 
f,(¥; | X;) and the sample model f(y, | x,) are the 
same if Pr(4, =1| y,, x,) =Pr(4, =1|x,)Vy,. By 
(2.2), the response process is ignorable if Pr(R,=1 | Vs 
x; 1,= 1) = Pr(R,=1 | x,,/,=l)Vy,. Thus, a_ possible 
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way to account for the sampling and response effects is to 
add to the model covariates all the variables and interactions 
determining the sample and response probabilities and then 
integrate them out in order to estimate the model of interest. 
Denote these variables by J =ZUL_ with population 
values J,,, where L defines the variables explaining the 
response probabilities. Assuming //, phvel XeyoJr)= Tool ae 

Jy), the use of this approach requires to fit first the ode 


Fy(¥51 Xo Ju = Jy) = [Fe Yel Xu» i AY;, 3.3) 
and then integrate, 
Te X,) = ep velta de Ge ed je: 


Variants of this approach can be found in DeMets and 
Halperin (1977), Holt, Smith and Winter (1980), Nathan 
and Holt (1980), Jewell (1985), Skinner (1994), Chambers 
and Skinner (2003, Chapter 2) and Gelman (2007). 

The use of the approach is appealing, and it has the 
advantage of allowing classical model based inference 
procedures once the variables J,, = Z,, U L,, are included 
in the model, but it is often limited in practice for the 
following reasons: 

1. It requires knowledge of the population values of all 
the variables determining the sample selection and 
response, and this information is usually unknown to 
the analyst fitting the model because of confi- 
dentiality restrictions or other reasons. Even if 
known, including in the model all the geographic 
and operational variables used for the sampling 
design and the variables explaining the response 
may be formidable. 

2. In practice there may be many covariates and many 
design variables, and modelling the relationship 
between the design variables and the covariates in 
order to integrate out the effect of the former 
variables can be complicated and may no longer 
reproduce the original target model. 


(3.4) 


Feder (2011) proposes the following simple solution to 
this problem. Suppose first that the design variables and the 
covariates are known for every element in the population. 
The proposed solution consists of imputing the missing 
population outcomes using the model ler Sie) 
fitted to the sample data, and then fitting the population 
model f,(¥;|X,;) using all the population values, with the 
missing outcomes replaced by their imputed values. When 
the design variables and the covariates are unknown for the 
non-sampled units, they need to be imputed as well. The 
imputation may be carried out by sampling with replace- 
ment (N —n) values (x;, z,) from the sample values with 
probabilities p, =(w,—-1)/Yf4(w,-1) on each draw, 
where the w,’s are the sampling weights. See Pfeffermann 
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and Sikov (2011) for justification of this procedure under 
the sample model and an extension for the case of NUAR 
nonresponse. 


3. The approach is not operational when the inclusion 
in the Pre depends also on the outcome see 
that is, SR aa and Pr@4 = 11 ¥,, 

Ze aN $0 fa =1| Xp, Ze ). A classical rs is 
case-control studies (Scott and Wild 2009), but a 
similar problem arises when the nonresponse is 


NMAR. 


Remark 4. Including the design variables and the variables 
explaining the response in the model does not necessarily 
require integrating them out even if they are not part of the 
covariates of interest, as the following example shows. 


Example 4: Suppose that a sample of size n is selected with 
probabilities defined by the population values of design 
variables Z and that all the sampled units respond. Let the 
population distribution of Y, X, Z be multivariate normal. 
The data available to the analyst consist of the sample 
values [y,, x,] and the population values Zy. Using prop- 
erties of the multivariate normal SRea es E,(y| x) = 
By + B,,.x for some coefficients (Bo. B,,.)s te the OLS 
estimate of B,. is biased because the sampling probabilities 
depend on Z, which is correlated with Y. The mle of Be 
for the case of a trivariate normal distribution is (DeMets 
and Halperin 1977), 


A SONG 2 &? 
me al { tate ( is i} / { apa “a3 (3.5) 


where Sy =n D7, (u,—u,) (v,-v,) and 62 =N1D, 

(z,-Z,)°, with w,, V, and Z, defining the corresponding 
sample and population means. Thus, the population values 
of Z feature in this case in the optimal estimator of the 
target parameter B.. Holt et al. (1980) extend this result to 
the case where Y, X, Z are vector variables. Nathan and 
Holt (1980) establish conditions under which B,. is con- 
sistent without the multivariate normality assumptions. 
Pfeffermann and Holmes (1985) study the robustness of the 


estimator to model misspecification. 


3.3. Using the sampling weights as surrogate for the 
design variables 


For situations where there are too many design variables 
determining the sample selection to include them all in the 
model, or when some or all of these variables are unknown 
to the analyst, it is often advocated to include in the model 
the sampling weights as surrogate of the design variables. 
Examples of the use of this approach can be found in 
DuMouchel and Duncan (1983), Sarndal and Wright 


Statistics Canada, Catalogue No. 12-001-X 


122 Pfeffermann: Modelling of complex survey data: Why model? Why is it a problem? How can we approach it? 


(1984), Rubin (1985), Chambers, Dorfman and Wang 
(1998) and Wu and Fuller (2006). 

Rubin (1985) defines the vector a = (a,..., dy)’ = 
a(Z,,) to be an adequate summary of Z, if Pr(4,= 
1| Z,,) = Pr(4,=1| a). The author shows that the vector 
Ty = (M%, +5 %y) of the sample inclusion probabilities is 
the coarsest possible adequate summary of Z,, though it 
may be too coarse. It follows therefore that for sampling 
designs such that Pr(4,=1]| ¥y, Z,) = Pr(4,=1| Zy), if 
,, is an adequate summary, the sample selection can be 
ignored for inference on the parameters of /,,(¥,| X,, Ty). 
In order to estimate the target model /,(y | x) in this case, 
one can follow the same steps as in Section (3.2) with 1, 
taking the role of Z,,. 

The use of this approach reduces the dimension of the 
added covariates but it requires knowledge of the sample 
inclusion probabilities (or the sampling weights) for all the 
population units, which may not be available in the case of a 
secondary analysis. The case of nonresponse is particularly 
problematic since the response probabilities are generally 
unknown and need to be estimated. Another major problem 
with this approach is that for general sampling designs, the 
vector ,, may not be an adequate summary of Z. Sugden 
and Smith (1984) and Smith (1988) establish necessary 
design information required for sampling ignorability. 


Remark 5. Even though the vector 7, is not always an 
adequate summary of Z,,, for sampling designs such that 
PrUi;=1| y,%,%) =, LOX» %) = f,0%i1 Xp ™), 
so that the marginal population and sample pdjs for a given 
sampled unit are nonetheless the same when adding 7, to 
the covariates (see Skinner 1994). 


Remark 6. In the empirical set up described in Section 3.1 
there is a one to one correspondence between the design 
variables (z',, z;) and the sampling weights (w,, w,). 


3.4 Methods based on probability weighting 


So far we considered methods requiring knowledge of 
the variables J determining the sample selection and 
response probabilities, or at least an adequate summary of 
them. The methods considered below only require knowl- 
edge of the sampling weights for the responding sampled 
units. As such, they are restricted to situations of full 
response, or to cases where the response probabilities can be 
estimated sufficiently accurately, in which case the sam- 
pling weight for a responding unit is the inverse of the 
product of the unit’s selection probability and its estimated 
response probability. Probability weighting (PW) is dis- 
cussed in numerous articles; see the recent discussion in 
Pfeffermann and Sverchkov (2009) and the references 
therein. As before, we focus here on estimation of popula- 
tion models. 
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To introduce the idea, consider the case of a census with 
full response. Assuming independent outcomes, the model 
parameters, 0, are typically estimated in this case by 
solving census estimating equations of the form, 


Dil Xp 9) = 0. 


In the case of mle, u(y,, X59) = (0/0B)log f(y | x; 9), 
the j™ score. In practice, data are available for only a 
sample s < U and the equations (3.6) are replaced by their 
randomization unbiased Horvitz-Thompson estimator, 


ae W, u(Y;, Xi Q) = 0, 


where the w, ’s are the sampling weights. 


(3.6) 


(3.7) 


Remark 7. When the census estimating equations (3.6) are 
the likelihood equations, the estimators obtained by solving 
(3.7) are known in the sampling literature as ‘pseudo mle’ 
(pmle). See Binder (1983), Skinner efal. (1989), 
Pfeffermann (1993, 1996) and Godambe and Thompson 
(2009) for discussion with many examples. This approach 
is implemented in many software packages such as SAS, 
STATA, SUDAAN, etc. 


Example 5. In the case of the standard linear regression 
model, the pmle or PW estimator of the vector coefficient B 
solves the equations ¥);.,W;(¥; — X; Byy) X; = 9; 


5 
pe = bee Wj Xj x; | oN, W; X; Jj. 


The PW estimator of the residual variance is Gee = 
Lies WV; —X} Bow) / (Lies W;—*), where k = dim(B). 

For logistic regression, the pseudo likelihood equations 
(with no explicit solution) are, 


ies ML: 2 B(X;)] XxX; = 0; PA(X;) 
= Pr Oe) 
= exp(x; B)/[1+exp(x; B)]. (3.9) 


Example 6. Let u(y;;9)=[A(@—y,)—F,(9)] where 
F’,(9) is the cumulative population distribution at 8 and 
A(a)=1(0) when a>0(a<0). The PW estimator of 
F,(0) is F,, wy (0)= Dies WA(O—9,)/ Liew, the familiar 


Hajek (1971) estimator. 


(3.8) 


The notable property of PW estimators is that they are 
generally r — p consistent. (See Section 2.2 for definition 
of the r—p distribution). This can be seen by decom- 
posing Ge, —8)= (6... — 6...) + (Bsen — 8), where 6. is 
the (hypothetical) solution of the census equations (3.6). 
Under general conditions, Gi aie eS O, (n°°) and 
Ce SR OK ~°), thus establishing the r — p con- 


sistency of 9,,,, under these conditions. The r— p_vari- 
ance of 0,, can be decomposed as, 
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Var, (0.,) = E,[Var,(6,,,)] + Var,[E,(6,,,)]. @.10) 


For single stage sampling, if n is much smaller than NV as 
is usually the case, the second term on the right hand side of 
(3.10) is negligible compared to the first term, and 
Var,_ BO.) can be estimated by the randomization vari- 
ance estimator Var AOS): This result does not necessarily 
hold for cluster sampling since in this case Var, es yrs 
typically of order O(1/m) where m is the number om sam- 
pled clusters, and under a suitable model Var APs ( )] is 
O(1/M) where M is the number of population elaste 

For Var ai(Oxm) to be an adequate estimator of Var, (8 
in this case, m must be much smaller than M. 


pw) 


Remark 8. The consistency of PW estimators under correct 
population model specification may also be established 
under the sample distribution (Equation 2.1). Consider the 
estimator Bow in (3.8) and write p= = Pee] Dre kee al 
Dies Xi W,5, wanes the €, ’s are the population model resid- 
uals. The key result jeading to the consistency of o- under 
the sample distribution is that if Pr(/, =1| y,, X;, 1) =T; 
then E,(we;)= E,(w, )E,(€,)=0 (follows from 3. 14 
below). In fact, by viewing the covariates as random with 
(y;, X;) having some joint distribution, 


B =arg min £, (y, — x; B)° =argmin E,[w,(y,— x! B)’J, 
B B 


implying that Bow is the optimal estimator (in weighted 
least-squares metric) of B under the sample distribution of 
();, X;). See also (3.24) below. Godambe and Thompson 
(1986, 2009) establish and discuss other optimality proper- 
ties of estimators solving estimating equations of the form 
Lies WU (V;, X;; 8) =0. The following example shows how 
probability weighting can be used when modelling clustered 
populations. 


Example 7. Consider the population two-level (random 
intercept) model, 


revel: 
u; ~ N(tly,0;), i =1,...,M 


(3.11) 
Level 2: 
Vay et eB ~ NOG, ), f= 1..N, 
where ¢, and wu, are independent for all i and j The 


unknown parameters are the vectors of coefficients 
3 = (B', y')’ and the variances t = (62, 67)’. Assume full 
response. Under ignorable sampling of first and second- 
level units, the mle of (9, t) is computed conveniently by 
iterating between the estimation of 9 for ‘known’ t and 
the estimation of t for ‘known’ 9, with the ‘known’ values 
defined by the estimators from the previous iteration. The 
two sets of estimators on the r" iteration are the solutions of 
linear equations of the form, P'?9 = g', Rr = 5, 
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with appropriate definition of the matrices (P), R”) and 
the vectors (q°”, s“), r = 1,2,..., (Goldstein 1986). When 
applied to all the population values, these equations define 
the census estimating equations. 

Suppose, as before, that a sample S, Of first-level units is 
sampled with probabilities =, =Pr(ies,), and that sub- 
samples s,, of size n, < N, are sampled from each selected 
first-level unit i with probabilities 7 ji = Pr(jes,;|ies,). 
The pmle for this model can be obtained by first expressing 
the elements of the matrices (P’”, R’) and the vectors 
(q'”, s) as sums over first and second-level units, and 
then estimating each ea sum of the form Yd, by 
the H-T aoe Dies, (a; / 7; ), and each population sum of 
the form 3 d,, by the H-T estimator Dies, (dy /T,,). See 
picienane Shinfen Holmes, Goldstein and Rasbash 
(1998b). Pfeffermann and Sverchkov (2009) review other 
methods of probability weighting in two-level models. 

Probability weighting is in broad use both for estimation 
of finite-population quantities, referred to in the literature as 
descriptive inference, and for ‘analytic inference’ on popu- 
lation models. The main attraction of this method is its 
simplicity. It is generally viewed as being ‘model free’, 
except when having to estimate the response probabilities, 
which is often based on models, and hence more robust than 
other methods, but when used for analytical inference, this 
view is questionable. 

Probability-weighted estimators are randomization 
consistent for the corresponding descriptive population 
quantities (CDPQ), defined as the (hypothetical) solutions 
of the census estimating equations. However, if the popu- 
lation model is misspecified, the target CDPQ are not 
(model) p-consistent for the true model parameters and the 
PW estimators are not r — p consistent either. So, proba- 
bility weighting provides no protection against model mis- 
specification, although the estimated CDPQ may be useful 
for various kinds of inference. See Pfeffermann (1993) and 
Binder and Roberts (2009) for discussion and examples. 

Estimating the randomization variance of probability- 
weighted estimators is generally simple, utilizing available 
techniques in finite population sampling. Binder (1983) 
developed a general approach for estimating the ran- 
domization variance of estimators obtained as the solution 
of probability-weighted estimating equations; see also 
Binder and Roberts (2009) and Godambe and Thompson 
(2009). Fuller (1975), Binder (1983), Chambless and Boyle 
(1985) and Francisco and Fuller (1991) developed central 
limit theorems applicable to probability-weighted esti- 
mators. 

In spite of these desirable properties of probability- 
weighting, the method has some severe limitations: 


1. It is restricted mostly to point estimation. Prob- 
abilistic inference like confidence intervals or 
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hypothesis testing generally requires large sample 
normality assumptions. In particular, the ran- 
domization distribution does not lend itself to the 
use of classical inference methods such as like- 
lihood-based or Bayesian inference. 

2. The variances of probability-weighted estimators 
are computed with respect to the randomization 
distribution and the use of this approach does not 
permit conditioning on the selected sample, for 
example, conditioning on the observed covariates 
or the selected clusters in a multi-level model. 

3. As often illustrated in the literature, probability- 
weighted estimators generally have larger vari- 
ances than model-based estimators, notably for 
small samples and large variation of the sampling 
weights. 

4. The use of the randomization distribution does not 
lend itself to prediction problems such as the 
prediction of the outcome for non-sampled units 
with known covariates under a regression model, 
or the prediction of small area means for areas 
with no samples in a small-area estimation 
problem. 


3.5 Modifications of the sampling weights 


When estimating finite population quantities, the sam- 
pling weights are often modified by imposing calibration 
equations, which match the PW estimators of covariates for 
which the population totals are known with the actual totals. 
The use of calibration is particularly useful in the case of 
nonresponse; see Kott (2009) for recent discussion with 
references. We later discuss the use of empirical likelihood 
for analytical inference on population models, which also 
attempts to incorporate calibration equations, although in a 
different manner. Below, I review two modifications of the 
sampling weights aimed at reducing the variances of the 
weighted estimators of model parameters under the sample 
distribution (2.1). A combination of the two modifications is 
also considered. 

Magee (1998) considers a linear regression model but the 
results can be extended to other population models. The 
author shows that under certain moment assumptions, any 
estimator Bing (@) = [Lies W; a;(0)x;x;) Dies aj(Q)X;¥; 
with positive weights a,(a) = a(xX,, &) is consistent for B 
under the sample distribution. The weights a(x;, a) belong 
to a parameterized family of functions with the vector 
parameter o. chosen to minimize a scalar variance criterion 
such as the determinant or the trace of the asymptotic 
variance estimator, 
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Avat[B ng (@)] 
= eet: w, a;(0)X;X; F a wea; (a) 87x;x! 
[dice aCoxixs | 


where &, =(); — x'B.,,): The choice of the function 
a(x,,) is up to the analyst but the obvious idea is to 
choose a function that is believed to be approximately 
inversely proportional to the residual variance under the 
sample model. The resulting ‘Quasi-Aitken’ estimator is 
shown to have asymptotically a lower variance under the 
sample distribution than the probability-weighted estimator 
hee Recall from Remark 8 that Pie is consistent for B 
under the sample distribution, justifying comparing the 
asymptotic variances of the two estimators under this distri- 
bution. 

Pfeffermann and Sverchkov (1999) propose another 
modification. Consider the population model, 


(3.12) 


y,=mx,; O)+E,, E,€,| x.) =0, E,(€;| i) o”, (3.13) 


where m(x_,; 9) has a known form. Let g;=w, /E,(w;|X;). 
The authors show that if Pr(/;=1| 1;, y;, X;) = 


Ey | x,) = E.(w,y;| x,) / E,(w,| X;). (3.14) 


Thus, for vectors 6 in the plausible parameter space ©, 


§= argmin—>._ E, {Ly, — m(x,; 8) | x;} 
6 n 


ae wy 
argmin— >... {q: Ly; ah m(X,;; 8) | Xj f- 
8 


The vector @ can be estimated therefore by solving the 
minimization problem, 


a age A 0) 
0,= arg min— > 2141 — m(x,; 9); 


g, = w,/E,(w,| x,). (3.15) 
The use of this estimator requires estimating E,(w,| x,) but 
under mild regularity conditions f) , 18 consistent for @ even 
when the expectation E,(w,|x;) is misspecified. See 
Pfeffermann and Sverchkov (2009) and Section 4.1 of this 
paper for examples of the specification and estimation of 
Ee (W,| Xe). 


Example 8. Under the linear regression population model 
with constant variance, 
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. [axxi] aie 


ies ies 


(3.16) 


As easily verified, B, is randomization consistent for ne 
census regression coefficients B= bse! =X ,x', /E,(w, |x, ve 
Wis j¥;/E,(w,|x,), and hence p—r consistent for B, 
even hen jh (w, |x, ’) is misspecified. 

The obvious difference between the PW estimator Om 
and the estimator 6, is that the latter estimator uses the 
adjusted weights g,= w, /E. (w,| x;). When the sample 
selection depends only on the covariates, the sampling 
process is ignorable. Hence, to protect against informative 
sampling, it is only necessary to account for the net 
sampling effects on the target conditional pdf of Alas 
This is achieved by using the weights q;. In contrast, the 
sampling weights w, account for the sampling effects on 
the joint distribution of (y,, x,). As a result, they tend to be 
more variable and the estimator she has a larger variance. 

A combination of the last ae modifications is also 
possible and examined in Section 4. The simple idea 
proposed by Dr. Moshe Feder (private communication) is to 
apply the modification of Magee (1998) to the estimator B, 
instead of the estimator ie that is, use the estimator, 


pee (a) 
io qi Gig (0) X;X; T ae qj aig (Q)X,¥;,, 


where the vector parameter a is now chosen to minimize a 
scalar variance criterion of the asymptotic variance esti- 
mator, Avat[Bng- q(@)], computed similarly to (3.12). 


(3.17) 


3.6 Likelihood based methods 


3.6.1 Use of the sample model for maximum 
likelihood estimation 


A natural way of estimating the population model 
parameters is by maximization of the sample likelihood. 
Assume first full response and that the sample observations 


are independent under the sample distribution. The like- 
lihood has then the form, 


L,(8, Y Veep X,) 
= Ul Prd;=1| x,, ¥5 Y) Fn ; | x;; 8) 


Prd;=1| x; ¥, 9) 
As before, we assume Pr(/J,;=1| 1,, y,, X;) = 7, implying 
Pr(/;=1]| x, y,) = E,(n,| x; y;). By (3.14), The sample 
likelihood can be written therefore as, 


E,(w,|x 38.) f, (| x59) 
L,(8, Y> Vs X,) a 1 cea Gaemeer 


The expectations on the right hand side of (3.19) are with 
respect to the sample pdf of the sampling weights. Thus, 


(3.18) 


sey) 
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when the weights are known for the sampled units as is 
usually the case under full response, the expectations can be 
modelled and estimated by regressing w, against ( pres); 
using classical model fitting procedures. Suppose first that 
the weights are continuous such as in probability propor- 
tional to size (PPS) sampling with a continuous size vari- 
able. For a given form of the population model, the expecta- 
tions E.(w,| y,, x;; 7) and E,(w,| x,; y, 8) can be obtained 
then in two steps: 
1. Identify and estimate E Swe = 
X;; Y), using the sample data. 
2. Integrate [[1/£,(w,|y, x; MAY |x; 8)dy to ob- 
tain E,(7,|x;; 9; 9). Gomis E(w |x,; , 7) = 
VET, ix; 9, Y) (follows from 3.14). 


E(w, |¥;, 


Estimating the vector parameter y outside the likelihood 
and then substituting the estimate in (3.19) and maximizing 
the likelihood as a function of the vector parameter 9 only, 
usually yields more stable results than maximizing the 
likelihood over (0, y) simultaneously. 

Estimation of the expectations E,(w,| y,,x,;) and 
E,(w,| x;;9, y) in the case of discrete inclusion proba- 
bilities is similar. 


Example 9. Consider the case of multinomial-logistic 
regression with a discrete covariate x and M possible 
values of the outcome y. Assuming that £, (w,| y= 
m, x, = k) is not a function of the model parameters, it can 
be estimated by w,,, the mean of the weights in cell 
(m,k), and thence #,, = Pr, (i nS | ik) 
(1/w,,). We obtain: 


PLY = i | — k,0) 


ee eae) el) 


> (Pr, (y,= m |x, =k; 0)/ 0. 

m =1 
The sampling weights feature in the sample model, but this 
is not an application of classical probability weighting. 
Notice that with this approximation the parameters in the 
population and the sample model are the same. In our 
empirical study we use a similar approximation for the 
sample distribution by categorizing the values of a con- 
tinuous outcome. See Pfeffermann and Sverchkov (1999) 
for other examples. 

Next consider the estimation of the vector parameter 0 

governing the population model. Under mild conditions, 6 
is the unique solution of the sak 


fe jeu Hp (9;|%,) = 


= (8 5,4)’ = Olog f,(y;|x,;0)/08. (3.21) 


7,0? Oops 


Pfeffermann and Sverchkov (2003) consider three different 
approaches for estimating 8. The common feature of these 
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approaches is that the only data used for estimation are the 
observations {(y,, x, w,),i € 5}, similarly to the PW 
estimators and their modifications considered in Section 3.5. 
In Section 3.6.2 we consider the use of the ‘full likelihood’, 
which assumes knowledge of the covariates {x,, j € U}, 
and possibly also additional design information. 

The first approach redefines the parameter equations with 
respect to the sample model. Assuming that E£,(w,|x;; 
0, y) in (3.19) is differentiable with respect to 9, the 
sample model parameter equations are W,,(9)= i; £, 
{[dlog f (y,1x,; 0,7) /00]|x;} = Dies B,{[8, +Olog E, (w, 

x,; 8, y)/00]|x,}=0. The vector 9 is estimated under 
this approach by solving the equations, 


W,,.(8) = >-[8, + Glog E,(w,| x;; 8, y)/08] = 0. (3.22) 
The second approach applies the relationship (3.14) to the 
parameter equations (3.21). For a random sample from the 
sample model, the equations are now W,, (9) =)j-, £,(q; 
5,|x,;)=0, where 9g; =w,/E,(w,|x,). The vector 0 is 
estimated under this approach by solving the equations, 

em) > 45; = 0. 


1ES 


(3.23) 


The third approach uses the property that if 8 solves (3.21), 
then it solves also the equations, W,,(0) = Lu E,(6,;)= 
E.(d jv #,(;|x;)]=0, where E.(-) is the expectation 
of x (which is viewed as random) with respect to the 
population distribution. Hence, by (3.14), for a random 
sample from the sample model, the parameter equations are 
W,,(9) = di-, E,(w;6;) =0, with estimating equations, 


Wie (0) =o 0n= U0; (3.24) 
Note that the equations (3.24) are the pseudo-likelihood 
equations (Remark 7). 


Remark 9. The use of the weights q,= w,/E,(w;,| x;) for 
population model parameter estimation has been justified 
already in Section 3.5 by reference to least-squares esti- 
mation. See the discussion in that section regarding the dif- 
ference between the use of the weights g, and the weights 
w,. Pfeffermann and Sverchkov (1999, 2003) illustrate that 
estimating 8 by solving the equations (3.23) yields esti- 
mators with lower randomization variance than estimating 
@ by solving the equations (3.24). Notice that under the 
assumption of a linear regression model operating in the 
population, the solution of (3.24) yields the PW estimator 
(3.8), and the solution of (3.23) yields the q-weighted 
estimator (3.16). 


Remark 10. The use of the sample model for estimation of 
multi-level population models is considered in Pfeffermann, 
Moura and Nascimento-Silva (2006), using the Bayesian 
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approach. Pfeffermann and Sverchkov (2007) fit multi-level 
models for small area estimation under informative sam- 
pling of areas and within the areas, following the frequentist 
approach. 


So far we assumed full response. Next consider the case 
of NMAR nonresponse. In this case the response process 
needs to be modelled as well. By (2.2) and with added 
parameter notation the ‘respondents’ likelihood takes the 
form, 


b= le] fopbag he Sd sea) 

i=l 

pr aT yess TEL Gala ey 

= || caae ap pee I OE) 

i=] a fell er Sa eee 
where 6° = (0, y) represents the parameters of the sample 
distribution under full response (Equation 3.19), and y 
represents the parameters of the response process. Notice 
that unlike the sampling probabilities ,= Pr(i € s), which 
are generally known and can be used for estimating the 
probabilities Pr(/,;= 1] y,, x,; y) as explained before, the 
response probabilities are generally unknown. 

Chang and Kott (2008) propose a method of estimating 
the response probabilities, which uses known totals of 
calibration variables. The authors assume a parametric 
model for the response probabilities that may depend on the 
outcome value, and estimate the unknown parameters of this 
model by regressing the totals of the calibration variables 
against their H-T estimators. The weights used for the H-T 
estimators are the product of the sampling weights and the 
inverse of the response probabilities under the model. Let c; 
define the values of the calibration variables for unit 7 and 
denote p(y;,X;; Y ) = Pr(R,= 1| ieetelaalegveCnany 
and Kott (2008) estimate the unknown parameters by setting 
the nonlinear regression equations, 

ine ty Sh ppm ar 
it PV» XY) 
where CY = ye ; and ¢ is a vector of errors. The 
parameters y are estimated by the iterative algorithm 


me ae ey ee Ne OR ered 
GIy = 9) 4 {AGP YY GQ HV); 


_ , s ue C. 
Aaya? - “—i} (3.26) 
2 TY; V;5 2) 
where 
r Cc. 
|i —| 
FG) ee os unde on 
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is the inverse of the estimated quasi-randomization variance 
of 


r 


Dh eae 


u 
> 
mie? AVE.) 


computed at y = 7”. 

Chang and Kott (2008) do not assume a model for the 
outcome and their approach is therefore restricted to 
estimation of the model for the response probabilities. 
Pfeffermann and Sikov (2011) use the likelihood (3.25) for 
estimating population models assuming noninformative 
sampling. Maximization of the likelihood is carried out by 
iterating between maximization of the likelihood with 
respect 0° for given y, and the solution of calibration 
equations with respect to y for given @, using known 
totals of calibration variables, similarly to Chang and Kott 
(2008). The ‘given’ parameters are the estimates from the 
previous iteration. The authors show how to estimate the 
distribution of the missing covariates and outcome for a 
nonresponding unit and use this distribution for imputing 
the missing outcomes and hence predicting the finite popu- 
lation total of the outcome variable. 

Estimation of the population model by fitting the sample 
model has some important advantages not shared by the 
other approaches considered in this article. 


1. Once the sample model is specified, it lends itself 
to standard model based inference such as like- 
lihood based methods, Bayesian inference or semi- 
parametric modelling. It is important to emphasize 
in this regard that the goodness of fit of the 
postulated population model can be evaluated by 
testing the goodness of fit of the sample model 
fitted to the observed outcomes, using classical 
model diagnostic techniques. See Krieger and 
Pfeffermann (1997) and Pfeffermann and Sikov 
(2011) for appropriate test statistics with illustra- 
tions. 

2. The sample likelihood provides a coherent way of 
handling NMAR nonresponse when estimating 
population models. Methods based on probability 
weighting require knowledge or good estimators 
of the response probabilities. The use of the full 
likelihood (see below) requires knowledge of the 
covariates of nonsampled units. 

3. Application of this approach permits the use of 
conditional inference, given the sample of re- 
sponding units, for example, conditioning on the 
observed covariates. 

4. The models holding for the observed outcomes and 
the response probabilities define the model holding 
for the missing outcomes of the non-sampled units 
or the nonrespondents, which can be used for 
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imputation of these outcomes. Methods based on 
probability weighting and variants thereof allow 
estimating the population model but under infor- 
mative sampling and NMAR nonresponse, the 
population model cannot be used for prediction or 
imputation of the missing outcomes. See Sverchkov 
and Pfeffermann (2004) and Pfeffermann and Sikov 
(2011) for illustrations. 

5. The use of the sample model enables testing whether 
the sampling process can be ignored. Pfeffermann 
and Sverchkov (2009) review several test statistics 
proposed in the literature for testing the ignorability 
of the sample selection. 


3.6.2 The full likelihood 


Theoretically, a more efficient way of estimating the 
unknown population model parameters is to base the 
likelihood on the joint distribution of the sample data and 
the sample membership indicators. Under full response, the 
Jull likelihood is then, 


BAO. eile 6 Ve Oats Je 
A exawe = 1) X51) 7, O11 X28) 


ies 


| WAM Spe NS bea) 


JES 


(3:27) 


where I,, = (I), ..., Jy} is the vector of sample inclusion 
indicators and Pr(,=1|x,; 8, y) = JPrU, =1ly,, X 5, 
YS, (v;|X;, 8) dv, is the propensity score of unit j. The 
likelihood (3.27) assumes Pr(Iy |¥y, Xy) =Meev Pr(Z, |, 
X,) (Poisson sampling), but it can be generalized to other 
sampling designs. The full likelihood has the advantage of 
accounting for the sampling probabilities of units outside the 
sample, thus utilizing more information, but it requires 
knowledge of the covariates of all the population units. See, 
for example, Gelman, Carlin, Stern and Rubin (2003) and 
Little (2004). Modelling the joint distribution of the 
covariates for units outside the sample and integrating them 
out of the likelihood can be very complicated in practice and 
is formidable when there are many of them. Pfeffermann 
et al. (2006) compare empirically the use of the sample 
likelihood with the use of the full likelihood for multi-level 
models in a Bayesian context. The two approaches yield 
similar results, but this of course may not be the case in 
other applications. 

Another way of defining the full likelihood is by 
application of the Missing Information Principle (MIP, 
Orchard and Woodbury 1972). The basic idea is to express 
the sample score function as the conditional expectation of 
the population score function, given the sample data. 
Following Chambers and Skinner (2003, Chapter 2), define 
the full-sample likelihood as L(A) =f (As Vos X50 Ly, Zy) 
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where, as before, z,, is a known matrix of population values 
underlying the sample selection and 2 defines the unknown 
model parameters. The corresponding /full-population 
likelihood:is: Ly,(A) = f(s Yys Xo Lys Zy) where yy= 
(y,, Y;) and x,, = (x,, x;). The MIP principle states that, 


sc,(A) = (0/ dA)log[L,(A)] 
= E,[(0/ OdA)log Ly, (A) | ¥,.X1y.2y]- GB-28) 


Another identity defines the relationship between the 
population likelihood information matrix and the sample 
likelihood information matrix. 

Breckling, Chambers, Dorfman, Tam and Welsh (1994) 
and Chambers ef al. (1998) consider applications of the MIP 
to complex survey data. In particular, Chambers ef al. 
(1998) study the use of the MIP when only limited design 
information is available and not the full information entailed 
in z,. The authors show examples where the use of the 
MIP is more efficient than the use of the sample likelihood 
L.(9, y; ¥,, X,) defined by (3.19), which only uses the 
weights {w,, i € s}. The likelihood (3.28) can be extended 
to account for NMAR nonresponse but the application of 
this approach requires then knowledge of the population 
values of the variables explaining the response. The compu- 
tation of the expectation in the right hand side of (3.29) may 
not be simple either, depending on the population model. 


Remark \1. The use of the MIP method in the simulation set 
up of Section (3.1) requires knowledge of the covariates and 
stratification membership for units outside the sample. We 
didn’t find a way of applying the method in this case 
without further assumptions on the joint distribution of the 
covariates and the design variables. 


3.6.3 Empirical likelihood 


In recent years there is a growing interest in the use of 
empirical likelihood (EL) methods for analyzing complex 
survey data. The EL method as originally proposed by 
Hartley and Rao (1968) in the survey sample context and by 
Owen (1988, 2001) combines the robustness of non- 
parametric methods with the effectiveness of the likelihood 
approach. Two other important advantages of this method 
are that it lends itself very naturally to the use of calibration 
equations and that it enables the construction of confidence 
intervals without the need for variance estimation. 

Consider the model defined by (3.13) where for now we 
view the covariates as random, and denote g,= (y,, x;)’. 
Under some regularity conditions, the vector parameter 0 is 
the unique solution of the equation 


Om(x; 9) 


2 0 [y — m(x; 9)]? = 0. 
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Let p,,.... p, bea set of probabilities corresponding to the 
observations (g,, ..., g,,) Such that p, is the ‘jump’ (proba- 
bility mass) of the population cumulative distribution 
F(g;) at g;. It is assumed that F,, has its support on the 
observed values such that 


at Om (x, 0) 


j= Pi a0 Ly; Sn x; 0)] = 0. 


(3.29) 
Assuming independent observations, the EL of F, is 
L(F,)=T11p; Notice that if p; is a known function of 
some unknown parameters, L(F,,) coincides with the 
standard parametric likelihood. The (nonparametric) EL 
estimators of the probabilities p, are the solution p'”’ of 
the maximization problem, 


max'[ |, pest p= 0) >" Spr== 6G80) 


Pisses Dn 

yielding p'” =1/n,i=1,...,n. For the linear regression 
case, m(x,;9)=x/B and by substituting p'”’ for p, in 
(3.29) and solving the equations we obtain the EL estimator 
of B as B. =Bors When finite population means C” of 
variables C measured in the sample are known, they can be 
added to the maximization problem (3.30) by adding the 
calibration constraints >, p; meee This additional 
information is expected to enhance the estimation of the 
p,;’s and hence the estimation of the unknown model 
parameters. See also Remark 12 below. 

Suppose now that units are drawn to the sample (or 
respond) with unequal selection probabilities 7;. In this 
case it is common to replace the objective empirical 
likelihood L(F,,)=[]7p; by the pseudo empirical like- 
lihood L,,(F,) =i pj, where, as before, w, =1/7,. 
Notice that loglj(F,)=Xiiiw,log(p;) is the H-T esti- 
mator of logL,,,(F,)=Xfilog p. The pseudo EL esti- 
mators of the p,’s solve the maximization problem, 


max = Wiest 0. > eae) 
See, e.g., Chen and Sitter (1999). It is easy to verify that in 
the absence of benchmark constraints, the solution of (3.31) 
is p'? =w,/>",w, and by substituting p' for p, in 
G29), es = ei the PW estimator (3.8). 

The empirical likelihoods in (3.30) and (3.31) are with 
respect to the population distribution. Alternatively, one can 
obtain the EL estimator by defining the likelihood with 
respect to the sample distribution /.(g,)=Pr(U/; =1|g;) 
f,(g;)/PrU; =1), where by denoting t, = Pr(/; =1| g;), 
Prd, =1) = X, p;t; Following Kim (2009) and 
Chaudhuri, Handcock and Rendall (2010), the EL estimators 
of the probabilities p, are obtained now as the solution of 
the maximization problem 
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max| 7, log(p;t,) - nlog >). p,t, | 


SL. pie 0, ie Poa la tGne2) 
The solution of (3.32) is pj =1;'/d",1)' and by 
substituting in (3.29), 
= 
Baier Ax nd |e ai 'ary women Ges) 


The estimator ihe has the same re as the PW estimator 
Bie in (3.8), but with the weights t, 1 Pr (pe sie X;) 
instead of the sampling weights w,. In practice, one has to 
replace the probabilities t, by sample estimates Tt, See 
Section 4. 


Remark 12. The following possible enhancement to the 
estimation of the probabilities p, was proposed to me by 
Dr. Jae Kim in a private communication. pining 2 as 
before that Pr(ies|n;,, y,x,;)=7,, it follows that 7 

PE) yk) p(T; |¥;, X;) and hence that E toe — 
tT) iy. 3o= 0. Tite suggests adding calibration contains 


of the form 
eee Py Tym ty ok Ys, 


to enhance the estimation of the probabilities { Dy inG.31), 
where k(y,, x ;) =(g,) is some function of the observed 
outcome and covariates. Examples for plausible functions 
for the case of a single covariate x are, k(g av 3s 
k(g;)=y,/x, etc. The notable feature of the constraints 
(3.34) is that they do not require knowledge of population 
quantities like means of calibration variables, as is often 
assumed when advocating the EL approach for sample 
survey estimation. Clearly, when means C” of calibration 
variables are known, constraints of the form >”, jie Ow 
may be added as well. See also Remark 14. 


x) =0 (3.34) 


4. Empirical study 


In this section I report the results of a simulation study 
aimed at assessing and comparing the performance of the 
methods discussed in Section 3. The simulation set up is 
described in Section 3.1 and we use H =5 strata. The 
target parameters are the regression coefficients p'= 
(Bo, B,)=(2,1) of the population expectation (3.1). The 
simulation study consists of generating 2,000 populations 
and samples (one sample from each population) and com- 
puting the estimators, variance estimators and confidence 
intervals listed below for each sample. The population size 
is 5,000 with approximate strata sizes N , = 363, 554, 842, 
1,278, 1,963. (The strata sizes are random). The sample 
size is n = 300 with n, = 60 sampled units in each stratum. 
The sampling fractions are therefore highly variable across 
the strata. 
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We generated population values of a single discrete 
covariate x by first generating observations x, from a 
Gamma distribution with mean 2 and variance 4, Gnd then 
defining x, to be the nearest integer to x, if x,<5 and 

x,=5 Chenne The covariates are heeiore x, 
(1, x,)', with x, =0,1,...,5. The population covariates 
were eonered once and held fixed for all the populations. 

Figure 1 shows the population and sample pdfs of the 
outcomes y! for x= 2.5. 4.5, 

As can be seen, the population and sample pdfs differ, 
indicating the informativeness of the sampling process. 
Notice also that the population pdf is not normal because 
the random coefficients ¢ , are not normal. 

We study the performance of the various methods in 
terms of bias, variance, variance estimation, and confidence 
interval coverage. We assume for all the methods that the 
only available information are the observed outcomes and 
covariates (Y,,,X,,) for every stratum hf, the sample 
selection probabilities and the true strata sizes WN ya L 
believe that this is the practice in most real life applications. 


4.1 Estimators considered 


4.1.1 The OLS estimator 8 
ignores the sampling process. 


The use of this estimator 


ols* 


4.1.2 The estimator proposed by Feder (2011, see Section 
3.2). Application of this approach is in four steps. i) fit a 
linear model with constant residual variance in each stratum, 
ii) impute the missing covariate values for the non-sampled 
units by sampling with replacement (N,, —7,,) values from 
the 1, observed values in stratum / with probabilities 
Pu =; —)/ D,0%, —1) on each draw, where the 
w,; 8 are the sampling weights when sampling from 
stratum fh. iii) impute the missing y-values in each 
stratum by generating observations at random from the 
model fitted in Step 7). iv) fit the linear regression model 
of y on x by using all the population data, with the 
missing values for the non-sampled units replaced by the 
imputed values. We denote the resulting estimator by B ,. 


4.1.3 The PW estimator Bee (Equation 3.8). 


4.1.4 The estimator be proposed by Magee (1998, see 
Section 3.5). In our application we define a,(a) =(x, + 
0.1)" and search for the optimal power a in the range 
[2,2] minimizing the determinant of the asymptotic 
variance estimator (3.12). 


4.1.5 The estimator B , defined by (3.16). For the present 
study we do not assume any parametric model for the 
expectation E.(w,|x;) in the denominator of g, and 
estimate E(w, |x;)=w,(x;), the mean of the observed 
sampling weights for units with x = x,. 
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Figue 1 Population pdf (solid line) and sample pdf (dashed line) of y|x 


4.1.6 The modified g-weighted estimator cee defined 
by (3.17). The weights g, are obtained as in 4.1.5 and the 
functions a, ,() as in 4.1.4. 


4.1.7 Estimators derived by maximization of the sample 
likelihood (3.19). The use of this approach requires spe- 
cifying the population pdf and the expectation E,(w,|y,,X;)- 

The unknown population model parameters are 0'= (f', 0”) 

and we assume /,(¥; | x;; 9) = N(x; 8, o), which as 
noted before and illustrated in Figure 1 is not the correct pdf 
since the random coefficients ¢, are not normal (see Sec- 
tion 3.1). We estimated E,(w, | y,, X;; Y) nonparametrically 
and set up the likelihood as follows: 
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Let s, define the sample of units with x =x, of size 
oe first divided the sample into c(x;) Romoceneae 
sien based on the ascending values of the outcome y 
using the R function “hclust”. The c(x;)’s are between | 
and 7, depending on the sample size m, (one cluster if 
mh, SLOW 2 clustetsattew 9 = 20,0) clusters if m, 270). 
Denote by b, , the midpoint bere the highest } ites in 
cluster k and the lowest y-value in cluster (k+1), k= 
3 C(%)—1L, -and detine” B p= 00) Dy oy Ont Or 
by pis VS, , we estimated E, (w, | y;, X;) by the mean 
w, i y, X=, (x, ) of the sampling weights of units with y- 
values in the same interval. Substituting E,(w, | ¥, X;)= 
w,(y,, X;) in (3.19) defines the sample likelihood used for 
the present simulation study as, 
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LD (Ohivesx .) 
=A Ff, (¥;| X35 9) / W,(9;,X;) 
Se oe 
where F(b,.)=["f,(7|x;3 ®)dy (the CDF of the 
assumed normal pdf). 
The approximation (4.1) is similar to the approximation 


(3.20) proposed for the case where both x and y are 
discrete. 


(4.1) 


Remark 13. In order to facilitate the numerical optimizations 
used for the computation of the estimators po nell , and 
the maximum likelihood estimators in (4.1), we transformed 
the minimization problem min{/(0):0€(a,b)} to 
min { f[g(n)}:n €(—, ©)+} with the function g(n) de- 
fined as g(n)=[(b-a)tan '(y)]/2+0.5 (a+b). Notice 
that every 9€(a,b) has an image ne R; g(yn)=9, and 
argmin{ f(0):0 €(a,b)}=g(ng) where n,=argmin/[g(n)]. 


We used the R function nim for the numerical opti- 
mization, with the PW estimates as starting values. To 
prevent numerical overflows of the optimized function by 
evaluation of exponentials of large numbers, the maxi- 
mization was limited to the intervals {min[0. Shs B me 
38e(B..)h max({l. SBw>Bpw +38 (Bow)]} for B, and [0.56,,, 
ie 56, Sl forac. 


4.1.8 The empirical likelihood estimator Ba defined by 
(3.33). The computation of this estimator requires esti- 
mating the probabilities 1, = Pr(J, Silly; X,) =1/E, 
(wy, |y;,X;), and we use the estimator E. niga) = 

w,,(¥, X,) used for defining the likelihood (4.1), such that 
,=1/w,(y, x). 


4.2 Variance estimation 


We applied three approaches for variance estimation. 
The first approach estimates the randomization variance, the 
second approach estimates the variance under the sample 
model, while the third approach uses the nonparametric 
bootstrap method, which likewise estimates the variance 
under the sample model. 

Consider first the estimators defined by 4.1.1, 4.1.3 — 
4.1.6 and 4.1.8 in Section 4.1. All these estimators can be 
written in the generic form, 


B. = [2 


Sal Sal Sale oy Wit XiY 
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(4.2) 


where X) =[x,,...,x,], W, =diag[w,...,w,] is the 
diagonal matrix with the sampling weights on the main 
diagonal and 7. = diag[¢,, ..,1,], with the ¢,’s defined by 
the estimators. For Go i= 1 / w,, for Boas Lea Waste and 
so forth. The randomization variance of these estimators is 
estimated as, 
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Var, (B,) 


Ng HE Xe Jeol VAL oe Mat, X,Ge LX LX de 


hace 


(4.3) 


where e,,=(y,—x;B) and B is the census estimator. 
Using the double index (Aj) to define the j" unit in the 
sample s, of size n, drawn from straum h, we estimated 


Var, by: WLXe, | 
= Var pe Wij Fn t ) 
5 = '(n, = je (wv ‘hj Oni 
=e B,) and 


l 


ae ny, 
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nN, 


(4.4) 


= ~ — \ 
— © ) (WiFi 1 ~ Ep, ); 


where Cn = by X ny Vy 


Whi Gy 1? 


assuming with replacement sampling within the strata. 
A variance estimator under the sample model which 
accounts for possible heteroscedasticity is obtained as, 


Var,,, (B,) 
=(XW,T,X.0'| Dwr ex,x 


t 
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where é, =(y, —x/,). Randomization and sample model 
variance estimators for the estimator in 4.1.2 are developed 
by Feder (2011). For the maximum likelihood estimator 
under the sample model with the likelihood defined by (4.1) 
we only estimate the variance under the sample model using 
the inverse information matrix. 

Finally, bootstrap variance estimators for all the esti- 
mators are obtained by sampling with replacement n units 
from the original sample and re-estimating each of the 
estimators using the same computations as for the original 
sample. Repeating the same process independently B 
times, the bootstrap variance estimator is, 


] B A(h 
oP x 
a B Ab) 
B = =y; es 
where B represents any of the estimators defined by 4.1.1 — 


4.1.8 and Bo is the corresponding estimator computed for 
bootstrap sample b, b=1,..., B. 


Vary. (B) = 6) (B® — py’ 


(4.6) 


4.3 Computation of confidence intervals 


We consider two approaches of (1— a) level confidence 
interval (C.I.) computation. The first approach is the 
standard C.L., 
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6, + Ze (B,), k =0, 1, 


where B , Stands for any of the estimators considered and 
S.e(B,) is the corresponding estimator of the standard error 
as obtained by one of the methods listed before. The second, 

“basic bootstrap” approach uses the quantiles bs(k, a) of 
the bootstrap estimators B® ,. to compute the C.I. 


-a(er-$)ah-ofes)eci 


We tried also the use of the “studentized bootstrap method” 
but the coverage rates were not better with any of the 
estimators B,. See Remark 14 below. 


4.4 Simulation results 


Table 1 shows the empirical means of the estimates listed 
in Section 4.1 over the 2,000 populations and samples and 
the corresponding empirical standard errors (S.E.). Also 
shown are the square roots of the means of the variance 
estimates as obtained when estimating the randomization 
variance (“Ran.”) and when estimating the variance under 
the sample model (“S.M.”). Because of computing time 
limitations, the results for the bootstrap variance estimators 
(“BS”) are based on 300 bootstrap samples drawn from each 
of 500 original samples. These numbers of original and 
bootstrap samples were found to produce stable variance 
estimators. 

As expected, given the use of an informative sampling 
scheme, the OLS estimator has a relatively large bias of 
12% (5%) when estimating the intercept (slope). All the 
other estimators are virtually unbiased, except for eae 
which has bias of 2% and 1.5%. The almost unbiasedness 
of the EL estimator Ge is particularly encouraging given 
the somewhat crude nonparametric estimation of the proba- 
bilities t; = Pr(ies| y,, x;). Notice also that this estimator 
has similar empirical S.E. to those of the PW estimator. The 
small (but statistically significant) bias of B,,,. 1s explained 
by the fact that we assume a normal distribution under the 
population model, which as noted and illustrated before is 
incorrect. 

Regarding precision, the OLS estimator has the smallest 
S.E. but B, has almost the same S.E. (and is unbiased). 
This is explained by the fact that this estimator uses 
additional stratification Far ss not used by the other 
estimators. Note that B B 


mg? Bmg-g and particularly B, 
outperform 8 but B does not improve over f.,. 


pw? mg-q 
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Remark 14. Following my presentation of this paper at the 
2011 Statistics Canada symposium, iean- Francais Beaumont 
suggested to replace the weights 7; ' used for the compu- 
tation of f.,., by the weights ¢;! [B, (4;'), so as to account 
for the net sampling effects on the conditional pdf f(y | x), 
similarly to the use of the g-weights in , Notice that 
whereas the sampling weights w, may depend on y, x and 
possibly other variables, the weights t;' only depend on y 
and x. Application of this idea did not affect the bias but 
the empirical S.E. of the modified estimators are 0.151 and 
0.053, smaller than the S.E. of B,., and similar to the S.E. 
of 6 ~ 

Looking at the performance of the variance estimators, 
the first remarkable outcome is that the randomization and 
sample model variance estimators (Equations 4.4 and 4.5) 
are very similar for every estimator of the regression coef- 
ficients, even though they are computed very differently. 
For Bosc Bw and B, the variance estimators are almost 
unbiased but for the other estimators the variance estimators 
under-estimate the true variance. This is explained by the 
fact that these variance estimators ignore some of the 
operations involved in the computation of the estimated 
regression coefficients. Thus, in the case of the estimators 
Bee and pa _, the variance estimators do not account for 
the choice of the optimal weights a,(a), in the case of B, 
the variance estimator does not account for the random 
imputation of the vectors (y,, X;) for i¢U —s, and in the 
case of B,,,, and B,., the variance estimators do not account 
for the estimation of the probabilities Pr(ies| y,, x;). This 
under- estimation of the variance is corrected in almost all 
cases by use of the bootstrap method, see, in particular, the 
estimation of the variances of pr De and oo 

Figure 2 shows the empirical coverage rates of (1 — a) - 
level confidence intervals (C.I.) for a = 0.10, 0.05, 0.01, 
as obtained when applying the standard C.J. with the 
standard errors estimated by the BS method, and when 
using the basic bootstrap method. The figures in the 
horizontal axis are the nominal levels 

The coverage rates are almost always below the nominal 
levels but the under- coverage in the case of the standard 
C.I. is generally less than 4%. The two exceptions are when 
basing the confidence intervals on the OLS estimators (large 
under-coverage) and the mle estimator of the slope (under- 
coverage of 7% at the 90% nominal level), which is 
explained by the bias of these estimators. The under- 
coverage percentages when using the basic bootstrap 
method are generally slightly larger, except for the under- 
coverage of the C.I. for the intercept based on B which is 
more pronounced. 


sel? 
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Meee standard errors (S.E.) and square roots of means of variance estimates. Population model: E,(y;) = 
2+1xx,, Var,(y;) = (1+0.2x,)°V; +1 sais 
Method Intercept- By Slope- By 

Mean Emp. Mean Emp. 

Est. S.E. Ran. S.M. BS Est. S.E. Ran. S.M. BS 
Bas 2 sil 0.133 0.135 0.139 0.140 1.046 0.048 0.048 0.049 0.049 
By 2.006 0.133 0.126 0.126 0.135 0.999 0.051 0.041 0.041 0.052 
ee 2.008 0.166 0.167 0.169 0.157 0.998 0.059 0.055 0.055 0.056 
Pan 2.017 0.158 0.154 0.156 0.154 0.995 0.056 0.050 0.050 0.055 
P 2.011 0.153 0.157 0.159 0.147 0.999 0.054 0.051 0.051 0.052 
Bme-g 2.020 0.156 0.152 0.154 0.153 0.996 0.055 0.049 0.050 0.054 
Brae 1.960 0.159  ~—----- 0.143 0.152 1.026 0.054  — ----- 0.046 0.053 
Bor 2.031 0.164 0.143 0.143 0.159 0.995 0.058 0.049 0.049 0.057 


INTERCEPT 


INTERCEPT 


i = al S a) S wo S 0 Ss io S wo S a) S Yel > a) 
So a La cS fe | val E=) (ON: oS aN wy > N val =) 
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Figure 2 Coverage rates of standard (left) and BS (right) confidence intervals 


underestimation of the true S.E.’s. The use of more ad- 
vanced bootstrap C.I. such as double-bootstrap may correct 
this under-coverage. 


Remark 15. We computed also the standard C.J. with the 
S.E. estimated under the randomization distribution 
(Equation 4.4) and under the sample model (Equation 4.5), 
but except in the case of the estimators Bow and B > the 
under-coverage of these intervals was somewhat higher than 
the coverage rates in Figure 2 because of the under- 
estimation of the true S.E. by these S.E. estimators 
discussed before. The same phenomenon was observed 


5. Concluding remarks 


In this article I discuss alternative procedures proposed in 
the literature to account for informative sampling and 


when using the “studentized bootstrap method” with these 
S.E. estimates, which again can be explained by the 


NMAR nonresponse when modeling survey data. The 
empirical study is restricted so far to the case of linear 
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regression and single-stage sampling, and an obvious 
extension would be to consider other models and cluster 
sampling. The present study illustrates the unbiasedness or 
approximate unbiasedness of all the point estimators 
considered, but the standard variance estimators under- 
estimate the true variances in most cases since they fail to 
account for the extra operations involved in computing the 
corresponding point estimators. The bootstrap variance 
estimators produce much better variance estimators in these 
cases. The confidence intervals applied in the present study 
yield small under-coverage in most cases, but they should 
be improved, possibly by use of more advanced bootstrap 
techniques. Another important extension mentioned in the 
paper, which we have not investigated empirically so far is 
to incorporate sample based calibration constraints in the 
empirical likelihood method when based on the sample 
distribution. 

We plan to apply the various methods to several real data 
sets. This would require the development of diagnostic 
procedures that would allow comparing the performance of 
the methods since unlike in a simulation study, the true 
distributions and model parameters are seldom known in 
real applications. 
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A Bayesian analysis of small area probabilities under a constraint 


Balgobin Nandram and Hasanjan Sayit ! 


Abstract 


In many sample surveys there are items requesting binary response (e.g., obese, not obese) from a number of small areas. 
Inference is required about the probability for a positive response (e.g., obese) in each area, the probability being the same 
for all individuals in each area and different across areas. Because of the sparseness of the data within areas, direct 
estimators are not reliable, and there is a need to use data from other areas to improve inference for a specific area. 
Essentially, a priori the areas are assumed to be similar, and a hierarchical Bayesian model, the standard beta-binomial 
model, is a natural choice. The innovation is that a practitioner may have much-needed additional prior information about a 
linear combination of the probabilities. For example, a weighted average of the probabilities is a parameter, and information 
can be elicited about this parameter, thereby making the Bayesian paradigm appropriate. We have modified the standard 
beta-binomial model for small areas to incorporate the prior information on the linear combination of the probabilities, 
which we call a constraint. Thus, there are three cases. The practitioner (a) does not specify a constraint, (b) specifies a 
constraint and the parameter completely, and (c) specifies a constraint and information which can be used to construct a 
prior distribution for the parameter. The griddy Gibbs sampler is used to fit the models. To illustrate our method, we use an 
example on obesity of children in the National Health and Nutrition Examination Survey in which the small areas are 
formed by crossing school (middle, high), ethnicity (white, black, Mexican) and gender (male, female). We use a simulation 


Study to assess some of the statistical features of our method. We have shown that the gain in precision beyond (a) is in the 


order with (b) larger than (c). 


Key Words: Accept-reject algorithm; Binomial distribution; Generalized beta distribution; Griddy Gibbs sampler; 


Simulation. 


1. Introduction 


It is a standard practice to use models to “borrow 
strength” in small area estimation (Rao 2003). Owing to the 
sparseness of the data in each area, direct estimates for small 
areas are typically not reliable. Our procedure allows a 
practitioner to incorporate prior information about a linear 
combination of binomial probabilities, one for each area. 
This is a constraint that we include as a weighted average of 
the area probabilities in the standard beta-binomial model. 
The weighted average can be assumed known or unknown. 
In the case when this value is unknown, we consider the 
scenario when there is some information which can be 
elicited from an expert in the form of prior distribution. This 
is different from standard practice in design based survey 
sampling in which auxiliary information is incorporated as 
in ratio and regression estimators (Cochran 1977). When the 
value can be specified exactly, there will be an increase in 
precision because prior information is incorporated into the 
model. 

The beta-binomial model has been studied extensively. 
For example, Nandram and Sedransk (1993), Nandram 
(1998) and Nandram and Choi (2002) show how to do 
Bayesian predictive inference of finite population propor- 
tions of the small areas for binomial and multinomial data. 
These models assume that the binomial probabilities share a 
common effect, thereby permitting adaptive pooling of the 


data from small areas (or clusters). However, it is possible to 
improve on these models further by including additional 
information using covariates via generalized linear models 
(e.g., see Ghosh, Natarajan, Stroud and Carlin 1998). It is 
worth noting that none of these works propose ways to 
incorporate prior information about linear combination of 
model parameters. Substantial gains in precision are ex- 
pected when such prior information is incorporated in small 
area models; see Silvapulle and Sen (2006) for a book- 
length discussion of constrained statistical inference. It is 
also worth noting that Lazar, Meeden and Nelson (2008) 
showed how to include constraints in nonparametric 
Bayesian approach via a Polya um scheme to predictive 
distribution of finite population parameters. 

Our procedure is related to external benchmarking which 
occurs when a pre-specified estimator is obtained from 
external sources, such as a different survey, a census, or 
other administrative records. In benchmarking one wants the 
parts to add up to the whole. For example, when surveys are 
conducted over time, there are typically monthly surveys 
and annual surveys which are of much better quality than 
the monthly surveys. When the monthly surveys are esti- 
mated such that these estimates add up to the annual survey 
totals, there is a protection against model failure and there- 
fore improved estimates (i.e., reduced bias and possibly an 
increase in precision). These problems are prevalent in the 
government agencies especially in employment and sales; 
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see Hillmer and Trabelsi (1987) for an example on retail 
sales of hardware stores from the U.S. Census Bureau. 

Prior information from external benchmarking will lead 
to improved precision but can produce severely biased 
estimators as well. This will depend on how different the 
current survey is from the prior ones. Nandram, Toto and 
Choi (2011) applied external benchmarking to estimate the 
finite population means of small areas. The constraint is the 
finite population mean for the entire population is a 
prespecified value which again can be obtained from a prior 
survey, census or administrative records. In our current 
work we are not incorporating information about a linear 
combination of the finite population values, but rather we 
are inputting information about a linear combination of the 
superpopulation parameters (in this case binomial proba- 
bilities). 

We consider the problem in which binomial counts are 
obtained from similar small areas, and inference is required 
about the binomial probabilities. In the conclusion, we 
discuss how to extend our method to obtain the predictive 
distribution of finite population proportions. The standard 
beta-binomial model may be inadequate, and additional 
prior information must be incorporated. Our thesis is that 
there is an increase in precision over the standard beta- 
binomial small area model when prior information about the 
weighted average of the probabilities (e.g., average of the 
probabilities) is incorporated. That is, we incorporate prior 
information about a linear combination of binomial proba- 
bilities (a weighted average). The weights can be propor- 
tional to population sizes, and under proportional allocation 
they can be proportional to the sample sizes themselves. The 
purpose of incorporating prior information about the bino- 
mial probabilities is to increase precision, and at the same 
time one needs to control the bias. 

It is much easier for a survey practitioner to specify the 
value of the overall probability rather than the individual 
area probabilities. That is, the overall probability can be 
specified with relatively much less error than the individual 
probabilities. Of course, one can specify the overall proba- 
bility using prior information (a prior survey, census or 
administrative records), and so the specification of the 
overall probability will depend on the quality of the prior 
information. Thus, the problem falls naturally within the 
Bayesian paradigm because we are incorporating prior 
information about a parameter via a distribution. Thus, there 
will be gains in precision because of the extra information. 
However, a practitioner can still proceed when there is no 
prior information. One can use the ratio of the total success 
and total sample size over areas to form a reasonable 
specification of the overall probability which is typically not 
of interest. This estimate will have much higher precision 
than the one for individual areas. There will still be gain in 
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precision, but clearly such gain is due to using the current 
data (double use) and the constraint. 

One example of a survey in which reliable information 
can be obtained to perform the benchmarking is the Nation- 
al Health Interview Survey (NHIS) which is conducted 
annually by the National Center for Health Statistics to 
assess an aspect of Health of the U.S. population. This is a 
population-based survey and there are many health indi- 
cators of interest; one of these indicators is the number of 
doctor visits made in the past two weeks, and an informative 
quantity is the proportion of people who made at least one 
doctor visit last year (e.g., Nandram and Choi 2002). These 
proportions are useful for small domains formed by crossing 
age, race and sex for a particular state last year. Because the 
estimates over a state change very slowly over the previous 
years, the overall estimate from the year immediately pre- 
ceding last year can be used as a reliable benchmark for last 
year. If a reliable estimate cannot be obtained for the 
benchmark, one can construct an informative prior distribu- 
tion for it. For example, one can use the method of moments 
to equate the sample mean and sample variance of the 
overall estimates for the past few years to the mean and 
variance of a beta distribution to get a beta prior distribution. 
In either case, our procedure can be applied. 

The plan of this paper is as follows. In Section 2 we 
describe the methodology. Specifically, we describe the 
standard beta-binomial model, and we develop two addi- 
tional models to incorporate the extra information using 
appropriate prior distributions. We also describe posterior 
inference and how to perform the nonstandard computa- 
tions. In Section 3 we describe an illustrative example on 
obesity, and a simulation study to assess empirically the 
statistical properties of our models. Section 4 has con- 
cluding remarks. We also discuss how to do Bayesian 
predictive inference for finite population proportions. While 
we discuss binary data, we also show how one can extend 
our method to polychotomous data. 


2. Methodology 


We show how to incorporate the constraint into the beta- 
binomial model in two ways, thereby providing a set of 
alternative models. In Section 2.1 we describe the models 
and in Section 2.2 we describe posterior inference. We 
attempt to explain what the constraint does to the estimates 
of the probabilities using an approximation. In Section 2.3 
we describe the computation, and we describe a new 
algorithm as well. 


2.1 Models 


We assume that binary data are available from / small 
areas, and we assume that the probability that an individual 
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responds in the i" area is x,,i=1,..., ¢. Let n, be the 
number of individuals sampled from the i area, i = 
1; =... CeAlso' let s, denote the number of individuals with 
the characteristic and f, =n, —s, be the number of 
individuals without the characteristic in the i area, i = 
1, 2, ..., @ Then the standard beta-binomial hierarchical 
Bayesian model is 
ind 
5; |, ~~ Binomial (n,, 7,), (1) 


iid 


Tt |W, t ~ Beta {ut, (1—p)t}, i=1,..., 2 (2) 


and 


Pas ty eS ie Ogee (3) 


(Leays 


We use a shrinkage prior for t because it is proper and 
noninformative, and there are no conjugate priors. Priors of 
the form p(t) <1/t are discouraged; see, for example, 
Gelman (2006). Other alternatives are half Cauchy densities 
and gamma densities (one would need to specify the hyper- 
parameters). Henceforth, we will call the model specified by 
(1), (2) and (3) the unrestricted (UR) model or Model 1. 

We next describe the restricted model, which is an exten- 
sion of the unrestricted model. We obtain a simple linear 
combination of the binomial probabilities. Letting 7, = 
s,/n, and 


we have 


Thus, taking the 7, unknown, the linear combination is 
Li-1@; 7. 

Therefore, we need to make an adjustment in (2) to 
incorporate the restriction, 5/,@,, = @ conditional on 0. 
We do so by introducing the variable 6 = Y/, @, 7, — 0; 
so that the restriction is equivalent to @ = 0. Now one of 
the variables, 1,,i=1,..., ¢, is redundant. It is worth 
noting that one can choose any one of 7,,...,,, and 
without loss of generality and for ease of exposition, we 
choose m,. Thus, to incorporate the restriction, we trans- 
form n, to > = Dj-@,%, — 9, keeping 1,, ..., 1), un- 
transformed, and we let Ti aM <3 Tp) 

As the jacobian is 1/@,, 
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P(Ky, o|p, t, 0) = 


41 t=) (i-p)t-1 
1 et ae 8 


or Bit, (1 —w)a 


t-1 ut—l -1 (1-1) t-1 


2: meee (4) 
¢ Bit, (1a) 


where 


Onset i= ee ae 
O<p<1,7>0,$6+0-0, <> on, < +8, 
and 
é-l 
6+0-)> 0,7, 
= =I 


10) 


Ty 


(5) 
f 

Note that the joint prior density of (T»), ) in (4) is well 
defined. We wish to take = 0 in (5) to incorporate the 
restriction, but when = 0 the joint density of Tp) 1S not 
well defined. 

We assume 1,1, are independent a priori with 
P(H, t, 8) = p, (ML, t) Pp, (8), where 


aye oe USS esa Behe) 
+t) 


as in (3), and p,(8) is given by 


P, (u, T) = 


0 ~ Beta {tT , (1 — Uy) Ty}. (6) 


For the restricted model we consider two scenarios. 
Letting t) — 0, 8 becomes a point mass at Ui), and in 
this case @ = j1, is to be specified by a practitioner; we will 
call the adjusted model the fixed (FI) model or Model 2. We 
have a second scenario in which a practitioner specifies Le 
and t) but not 8; we will call this adjusted model the 
informative (IN) model or Model 3. Thus, there are three 
models, including the unrestricted model. To provide a 
unified framework, we need all our priors to be proper. The 
exact value of @ is likely to be unknown in most applica- 
tions, and this can lead to estimates which are not internally 
coherent. 

It is worth noting that we have considered an additional 
model to help study the gain in precision of IN relative to FI. 
For comparison we want to impose a proper but noninfor- 
mative prior on 0, so that 6 ~ Uniform (0,1) is not an 
unreasonable choice. Letting 1) = 1/2, t = 2, we get 
8 ~ Uniform (0,1) with this prior, and we will call the 
adjusted model the uniform (UN) model or Model 4; of 
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course, we do not need to specify ) and tp. It is worth 
noting that the prior corresponding to tT — © is improper 
as it corresponds to 0 ~ Beta (0, 0). We do not consider 
this model further; however, although UN does not have a 
constraint, we will consider it briefly throughout. 


2.2 Posterior inference 


We consider making posterior inference about 1,, 

=), ele LenS nes ty) ander annie ha, 
Tats sve My) [28:5 Trg = (Ty, ---5 M1) aS defined above]. 

We use Bayes’ theorem to find the joint posterior den- 
sities of all parameters. First, under the unrestricted model 
specified by (1), (2) and (3) the joint posterior density of 
Ts [LotGuas 


PES 


f,+-p)t-1 


CG -%,) 


g(t, pt | 8) « Neer eeer cee 


[pete is d= wo) 
= Bfur, (1-)9)} 


(7) 


Oneal, Oo ao lee 0 es ae 


Lemma | Under the unrestricted model the joint posterior 
density, g(T, U,t | 8), is proper. 
A proof of Lemma | is given in Appendix A. 

Under the restricted model the joint posterior density of 
Te), Lt, T, 9, O 18 


P(TMyys Us T 8, o | 5) 


-] s.+ut—l f,+C-p)t-1 
ie: (Leary 


I 


ioe ee 
jy BAS COM el — ay, 


0-] sptpt—l e-1 Jp+(-p)t-1 
$+0-) 0,7, $+0-) 0,7, 
i=] i=1 
eae je 
W, , 


Bis, +ut, fy +U-wt 


. 4| Bis, +ut, f, +(l-p)t} 
ni Bit, (1 — p)t} 


i=| 


i 


8 
(1+) 


= Ip )ty-l 
xgtorot(y — gy Ho" 


OS a ly el yey ey OND le Ce Oat ee me 
LY10,7t, <>+0,0<6<1. Note that 1, =(6+6- 
Li @,7;,) / @,. 
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We get the pertinent joint posterior density by incorpora- 
ting the constraint ( = 0) into (8). That is, p(7,,, ML, T, 
8|s,6=0)« p(n), H, t, 8, 6 = 0| 5), where 


P(Ky, LH, T, 9 | s,o = 0) 
pa pi haute ‘(1 ue mee ee 


OG 
i=l Bis, seal Uh dona ve eat nC 


0-1 Sp+pt—l p-1 Spt+U-p)t-1 


Bis, + wt, f, +L- wt 


“| Bis, + pt, f +(1- p)t} 
I) Bit, 1 - pw) t} 


i=l 
21 1 
ype ee 
(1+ 1)?” 
O-line OS 210, mS 
G0 =O <1 Note again that n,= (0 — X/10,7,) / @,. 
It is worth noting that the joint posterior density (9) incorpo- 
rates the fs ae yj-10,7, = 8, exactly because 1, = 
(0 - X@,2,)/ o,,9-, < Li@,7, < 0. That is, the 
joint posterior density is not a function of 1,, and posterior 
Seat about 2, follows from the identity, x, = (0 — 
Y/10,7,) / @,. Thus, there is absolutely no difference 
between 6 and ¥/_,0,7,. 


x gto! (y — 9)" H0"*0 (9) 


Theorem \ Under the restricted model the joint posterior 
density, P(T,),4,T,9 | 8, > = 0), is proper. 


A proof of Theorem | is given in Appendix A. 

We note the difference between the densities for the 
unrestricted model in (7) and the restricted model in (9). 
Essentially, the term 


-] Sp t+yut—l (1 ty +(1-)t-1 
= Yo,7, Oar Yon, 

i=l x 1 i=l 

@, @, 


e groro (4 - Oyo ae 
1-1," in (7). Note that in 
O— ar O,T; 


, 
De =f, +iGle Wl) ty ol 


ele: apg rae 


@, 


1 


in (9) replaces 1, aaa 


(9), 
Tt, = 


Ws ro ROR ee Se. é. Also 


let 
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and 


Then, 
P(t; | Ty» H, T, 9, s, b = 0) 
in mJ" (n, x ea ye 5 nytt, 


om 


(10) 


c, <1", <d,,i=1,...,2—1. Note that this density func- 
tion consists of two terms T,! {ae= age.) ~ and [<= c;) ee 
(d, — 7;) on '. note the interchange between a, and b, in 
the second term. The first term is the conditional posterior 
density under the unrestricted model, and the second term is 
a generalized beta density [i.e., a beta(b,, a,) distribution 
in the interval (c,, d,)]. Thus, the unrestricted beta density 
is adjusted by the generalized beta density. In the rest of the 
paper we denote by GenBeta(a, b, c,d) the generalized 
beta random variable with density function, 


P(x) = (x - €)""(d - x)" / {(d — c)*" B(a, BY}, 


Can Sd ae pi 1 


That is, (X¥ —c)/(d—-c) ~ Beta(a, b) if and only if 
X ~ GenBeta (a, b, c, d). 

It is worth noting that we have ordered the areas in order 
of their counts (smallest to largest). This is convenient and 
advantageous both theoretically and computationally. 

In order to explain the gain in precision, we attempt to 
study (10) further by making two approximations. First, 
because the restriction under study is rather mild we do not 
expect c, to be much different from 0 and d, to be much 
different from 1. Under this assumption, we can approxi- 
mate (10) by 


PAT; | Tl), HW, T, OS: ) me 0) 


-1 


a;-l b.-1 b,-l a, 
ee (; —¢) (d, as ieg ly (1; Ae (d, =e) es p) 
Ces Mts-d,« 


Then, incorporating the normalization constant into 
P,(T; | Ti, H, T, 8, 8, d = 0), we have 


Pa (%; | Tiys H, T, 0, S, = 0) 


— (%; =o) '(d, 1) ' (x, —¢)!(d, =n) 
fq, -e) =n), -0)" d=)", 
Se)! a any 
(d,-¢,)"" B(a, b) 
SF ie ay—l 
op AE eal Te hay CG; <a Tt; <a d,, (11) 


El(n,— 6) (d,- 1m) 7] 
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where the expectation is taken over the ere Beta 


distribution =, ~ GenBeta (a;, b,c, d;), i= ,£-1. But 
under this jatier density, (1, — & ve dea TC, x)" is an 
unbiased estimator of E [Gi e,) co "Ch eye a In addi- 


tion, by construction a, and by are relatively large and 
therefore (n, ye) (he =, os and its variance are ex- 
pected to be small. Then, our second approximation is 


m,)¢ J. (12) 


Therefore, combining (11) and (12), our final approximation 
of (10) is 


T;, | Ty, HT, 8, 5,6 =O0~ GenBeta(a,, b,, 


(x; — c)* (d, ri m,) 01x El(x,- a) aid * 


cea) (13) 


It follows from (13) that 
Em; | Ky, i, T, 8, s, b= 0) = €-+(d;—¢,) E, (%;|L, t, 8) 


and 
Var, (7; | Ti> Hs T, 9,5, & = 0) 
= (d; = cy Var, (7; | H, T, S), (14) 


where w refers to the unrestricted model and r restricted 
model. Note that when c,=0 and d,=1, we get 
E(x; |-) = £,(1;|-) and Var, (z,| -) = Var, (1,| -). Gen- 
erally though the estimates of 1, will be a bit different from 
one scenario to the other. It is also interesting that 
Var,(m; | -) < Var,(, |-) at least approximately. Thus, 
the restriction );_,@,, = @ will reduce variability, when 
the 1, are estimated. This is true because the ot ae 
1,..., £, belong to an ¢—1 dimensional simplex in the / 
dimensional hypercube while for the unrestricted model 
m1 =1,..., 4, belong to the ¢ dimensional hypercube. 
We expect the largest gain in precision when 0 is 
completely specified, followed by the case when Ul, 1S 
specified and t, >> 2, and the least gain in precision when 
6 ~ Uniform (0, 1). 


2.3 Computation 


We show how to draw samples from the unrestricted and 
restricted models. For the unrestricted model we are able to 
draw random samples from (7) without using Markov chain 
Monte Carlo methods. However, for the restricted model we 
use the griddy Gibbs sampler (Ritter and Tanner 1992) to 
draw samples from (9). 


2.3.1 Unrestricted model 


We collapse over the 1,, draw samples from p(t, t| s) 
using random draws from a bivariate grid, and finally obtain 
samples from the Rao-Blackwellized densities Tere Paces 
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Then, 
ind 
T, |icys aBetats, Hut, 7, Sy, eee CS) 
and integrating out 7, we get 
=, BAS, tut, f, +1 =) l 
P(u,t | Ss) oc ieee are 
i=l B{ut, (1 — pw) t} Cit 2) 
0<p<1,t> 0. Letting 6=1/1+1, wehave 
p(y, 6| s) 
é So = r 
oe pee LL) Tj Hou os 
i=l Biut, (1 — p)t} aon 


1-6 


First we draw pt, 5 | s using a bivariate grid on (0, De 
to obtain a sample of M ~ 10,000 values of (u”, 5”), 
h =1,..., M, co” = 8/1-— 8”. Then we perform a data 
augmentation in (15) to obtain t” he=l, 2, 4 IM, Using 
a composition method. That is, we simply draw 1, ~ 
Beta{s, + p<), f. +(- uw)? G=1,..,2,h= 
eens HVE 

To perform the bivariate grid method for sampling from 
the posterior density of (41, 5), we divide the interval (0, 1) 
into 100 sub-intervals; so there are 10,000 little squares in 
the original unit square. We obtain the heights of the poste- 
rior density (without the normalization constant) at the 
center of each of the 10,000 squares. Because these little 
squares have the same area, the heights of the bivariate 
density are proportional to the posterior probabilities that 
(uu, 5) fall in each of these squares. Thus, we have con- 
structed a joint posterior probability mass function of 
(uu, 5) on very fine grids. It is easy to draw a sample from 
the discrete bivariate probability mass function by using the 
cumulative distribution method. This is actually a random 
draw of one of the 10,000 squares with probabilities propor- 
tional to the heights of the little squares. Then within the 
selected square we choose a point at random by drawing 
two uniform random variables (i.e., uniform random jit- 
tering). Indeed, this is a very accurate random draw from the 
joint posterior density of (u, 5). We draw M =10,000 
samples from this approximation for posterior inference in a 
standard Monte Carlo procedure with independent samples, 
not a Markov chain. Because of the random jittering the 
numbers are different with probability one. 


2.3.2 Restricted model 


We show how to draw samples from the restricted model 
using the Gibbs sampler. The joint conditional posterior 
density of 1), ...5 %y_4 18 
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e-1 


Ss. =I -+(1—p)t-1 
x [linen (aay ace 


iz 
0-1 sptput—l ¢ p_) f,+Q-p)t-1 

x| O= ons | Die GeO), (16) 
i=) 1=1 


where 


f-1 eal 
2 = O70 — 0, < 10,7; mw OE, ee 
i=l QW, 


Thus, we would obtain samples of 17, ... 
set 


, T,_, and we 


to complete the vector 7,,..., 1,. That is, the constraint is 
obtained exactly. The conditional posterior density of 0 is 


pe | Ty, H, T, S, » = 0) 


0-1 sptut—l (-1 fp +0-p)t-1 
aC Son} oZon-o 
i=l 


x Qroo'( — By HOO”, (17) 


where 
él e 0 < (-1 
DG OG @, + ae O, 1; 


The joint conditional posterior density of ,: and t is 


P(t | Ky; 9-8, o = 0) 
as pw 1 


a er (18) 

[B(ut, 1-p)t)] (+7) 
0S ples Of = [iar Nei, 

To perform the Gibbs sampler, we need to draw samples 
from (16), (17) and (18), each in turn, until convergence. 
We draw pt, t from p(t, t| 7), 8, ) ina manner similar 
to drawing from p(t, t | %,,)) in the unrestricted model. It 
is more difficult to draw sample from (16) and (17). 
However, we use essentially the same method to draw 
samples from the conditional posterior density of 1,, 7 = 
1, ..., 4-1, obtained from (16) and 6 from (17) which are 
both proportional to the product of two density functions, 
one is a truncated beta density and the other a generalized 
beta density. We next develop some theory to draw a 
sample from such a density. For this purpose, we state and 
prove Lemma 2 and Theorem 2. 
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The density function of interest is 
1 Gl 41, (x) fala) Oe scere_ x dele (19) 
where 


g-l = h-| 
Pirhe MOS cole DN) c<x<d,g,h>0, (20) 
ead =x)" de 


FQ) = & = c)"(d — x)"1 (d= 0)? "BGG, bB)}, 
Ci daa, Die Oi 
and, of course, 
AVA MACIAC (22) 


It is worth noting that we are not assuming g, / > 1. If 
this was the case, then f(x) and f,(x) will be both log- 
concave, thereby making f(x) logconcave, and in this case 
one can draw a sample from /(x) using the adaptive rejec- 
tion sampler (ARS, Gilks and Wild 1992). We are providing 
a specialized algorithm to draw a sample from f(x) which 
is not logconcave. Even if f(x) was logconcave (ie., 
g,h > 1) this specialized algorithm will still be better than 
the ARS because the ARS is a general purpose algorithm; 
see Robert and Casella (1999, page 59). Our algorithm 
requires less computation and does not need logconcavity; 
even if there is logconcavity the ARS can perform poorly in 
the tails of the density function. 


Lemma 2 Consider the density functions f,(x) and Aaa) 
with a,b > 1. 


(a) Then 


67 (1 a“ ee 


Eig 


sup f, (x) = 


c<x<d 


(b) For any g >0,h> 0 there exist two constants H ; 
and H, such that 


Ont Sie S Aas 


A proof of Lemma 2 is given in Appendix A. 
Theorem 2 Let F’, ,(:) be the cdf of Beta(g, h) random 
variable and lacs (-) be its inverse. Let 
ind 
U,V ~ Uniform (0, 1), 
and let 
Deak (UB 


g,h 


(d) +(1-U)F, ,(o)}. 


If for two real numbers a, b > 1, 
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“is 1 Sy (ey 
eye ee lies , 


where 6 = (a-1)/(a+b-—2), then X has the density 
F(x) = Afi (x) 4). 


A proof of Theorem 2 is given in Appendix A. 

Theorem 1 gives us the following algorithm for drawing 
samples from f(1) o n®"(1— 2)" "(a — 0)" (d — 2)? 
Coane a enne 0. ape 


Algorithm 
(a) Draw U ~ Uniform (0, 1) and set 


n = F,,{UF, ,(d) + (1-U)F, ,(o}. 


(b) Draw V ~ Uniform (0, 1). If 


a 1 zy (24) 
“(d-o "5 Pet) 


accept 1, otherwise go to (a). 


Because the binomial sample sizes are arranged in 
increasing order, in any application it will be true that 
a,b>1 and g,h> 0 (possibly greater than 1 as well). 
Thus, the algorithm will work. Indeed, in all our examples 
(one presented here) and simulation exercises the algorithm 
runs very quickly. 

Now, we show how to draw ti = 
For 1,,, 


Icey 5 Gavel “). 


P(R, | Tn, 9, Uy T, S, > = 0) 
Peed | b.-] b,-1 veil 
acm, (l=)? (ipncp) G(dpaet,) © gees ae ids 
where m,;,) is the vector containing the elements of x 


except for m, and m,, and a,= s,+ pt, b, = f+ (1—w)t, 
Gide sol, 


f-l 
c, =| 0- » @,%, —@, | / @,, 
j=l, j4i 
é—1 
Bho pale Ser arr Crp te bowen 
J=), j#i 


Apply the theorem to p(z;|™,,), 9, U, 1, 5), a,>1, b, >1, 
was well. 
For 0, we have 


ee 
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p(8| nm, p, t, s, > = 0) 
oc oM0"'(1 — yO — a)" - 8)", E< <a, 
where 
és Sai Tijd 1, eG T;. 


Again, apply the theorem, a, > 1, b, > 1. 

When 6 is fully specified (7.e., 8 is not random), we do 
not have to draw 8. However, when 8 ~ Uniform (0, 1) 
apriori (11) = 1/2, t) = 2), we have a simplification. In 
this case, 


i) | Tp), H, T, S, =0~ GenBeta (a,, ope @ d) 


and 0 = @+(d—é)X, where X ~ Beta(a,, b,), has the 
required density. 

For both the unrestricted and restricted models we use 
10,000 iterates to make posterior inference about the bino- 
mial probabilities, m,. Under the unrestricted model these 
are simply random draws and no monitoring is required. For 
the restricted model, running the griddy Gibbs sampler, we 
drew 11,000 iterates, used 1,000 as a “burn in” (a conserva- 
tive number because convergence occur much earlier as 
evident in the trace plots) and we found negligible correla- 
tions among the iterates. Thus, we used 10,000 iterates to 
make inference about the binomial probabilities. For both 
the unrestricted and the three restricted models it takes only 
a few seconds on our 2 x 833 MHz alpha computer. 


3. Numerical studies 


In Section 3.1 we describe an illustrative example to 
show the main features of the restriction. In Section 3.2 we 
describe a simulation study to show frequentist properties of 
the Bayes estimators, and we show deeper insight into the 
differences among the four scenarios. Note again that when 
we performed the computations, it is convenient to order the 
domain sizes so that the largest domain comes last. 


3.1 Illustrative example 


We have used data in the third National Health and 
Nutrition Examination (NHANES) Survey to illustrate our 
method. We have studied body mass index for teenagers, 
and we have data on the sample obtained. The domains 
(small areas) are formed by crossing ethnicity (white, black, 
Mexican) and sex (male, female). We have separated out the 
teenagers with respect to whether they were in middle 
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school or high school at the time of the survey. Thus, there 
are 12 small domains. The data are presented in the first four 
columns of Table | by domain. Note that domains MWM, 
MBF, MWE and HBF are relatively sparse with 4, 2, 5, 5 
obese teenagers respectively; for the twelve domains the 
sample consists of 959 with 130 obese teenagers (i.e., the 
overall proportion of obese individuals is 0.136 approxi- 
mately). In column 4 of Table 1 we have also presented the 
direct estimates by domains, and these estimates range from 
0.069 to 0.228. The estimates for the smallest domains will 
be unreliable. Moreover, when the beta-binomial models are 
used, these estimates will regress to the overall sample mean 
of 0.136, creating a possible bias. Our method is expected to 
increase precision beyond the unrestricted model because 
the restricted model uses more information about the 
weighted sum. Clearly, predictors based on either the 
restricted model or the beta-binomial model are biased if the 
specified model is wrong. 

We have taken pt) =0.136, the overall sample propor- 
tion, and t, = 959, the total sample size. Less optimistic 
choices can be used. For example, t, = 100, say; but this 
choice makes very little difference. However, it is worth 
noting that using the observed data to specify the prior 
distribution can artificially decrease the posterior variance. 
Typically a survey practitioner will have an appropriate 
specification from a prior survey or a census. One cannot 
specify values for 4) and t, which are completely out of 
line and will create huge biases. Here 1, is a prior sample 
size and 1) 1s a prior mean of 8. This method permits a 
sensible value for 6; we are essentially adding a degree of 
uncertainty about knowledge of the linear combination. 
Thus, these specifications are not unreasonable. 

We have applied our method as described for the four 
scenarios. In the other columns of Table 1 we study the 
estimates of the small area probabilities. We present the 
posterior mean (PM), posterior standard deviation (PSD), 

RMSE = ,/ (% — PM)’ + PSD’, 

where i is the direct estimate, and the 95% highest 
posteriori density (HPD) interval (Int). As is expected, the 
PSDs are roughly in the increasing order: Model 2, Model 3, 
Model 4 and Model 1; in some cases the differences are 
important. The PMs for Models 1, 2 and 3 are mostly 
similar, but for Model 4 the PMs are mostly smaller than the 
other three models. There is much improvement of Models 
2 and 3 over Model | at least in terms of precision. This 
gain becomes less important for Model 4, the model with 
the greatest uncertainty about 0. 


Survey Methodology, December 2011 


Table 1 


145 


Comparison of the four models using posterior mean (PM), posterior standard deviation, root mean Square error (RMSE), and 
95% credible HPD intervals (Int) of 7 ; by domain (D) for the NHANES data 


D s n tt PM PSD RMSE Int [ PM PSD RMSE Int 
Model 1 ina Model 2 
1 4 47 0.085 | 0.114 0.033 0.044 (0.051, 0.179) 0.111 0.032 0.041 (0.049, 0.170) 
2) 2 29 0.069 Orne 0.037 0.057 (0.042, 0.183) 0.111 0.036 0.055 (0.041, 0.178) 
3 10 44 0.227 0.175 0.044 0.068 (0.100, 0.264) 0.177 0.041 0.065 (0.108, 0.260) 
4 5 62 0.081 0.107 0.030 0.040 (0.047, 0.159) 0.107 0.027 0.038 (0.054, 0.160) 
5 10 74 0.135 0.134 0.030 0.030 (0.077, 0.194) 0.134 0.028 0.028 (0.080, 0.190) 
6 12 69 0.174 0.158 0.036 0.039 (0.089, 0.227) 0.155 0.031 0.036 (0.095, 0.214) 
i 8 79 0.101 0.116 0.028 0.031 (0.065, 0.173) 0.115 0.027 0.030 (0.065, 0.166) 
8 5 62 0.081 0.107 0.030 0.040 (0.052, 0.169) 0.105 0.029 0.038 (0.042, 0.153) 
9 28 123 0.228 0.196 0.036 0.048 (0.129, 0.262) 0.196 0.032 0.045 (0.131, 0.253) 
10 10 111 0.090 0.106 0.026 0.030 (0.059, 0.155) 0.105 0.024 0.028 (0.061, 0.150) 
11 16 122 0.131 0.132 0.026 0.026 (0.083, 0.183) 0.130 0.023 0.023 (0.090, 0.179) 
12 20 137 0.146 0.144 0.026 0.026 (0.094, 0.194) _| 0.141 0.022 0.023 (0.100, 0.184) 
Model 3 vu Model 4 

1 4 47 0.085 0.111 0.033 0.042 (0.044, 0.169) 0.109 0.032 0.040 (0.050, 0.172) 
2s 2 29 0.069 0.111 0.037 0.056 (0.039, 0.179) 0.108 0.036 0.053 (0.037, 0.173) 
3 10 44 0.227 0.175 0.043 0.068 (0.093, 0.260) 0.170 0.044 0.072 (0.091, 0.255) 
4 5 62 0.081 0.106 0.029 0.038 (0.050, 0.160) 0.103 0.030 0.038 (0.048, 0.164) 
5 10 74 0.135 0.134 0.029 0.029 (0.077, 0.189) 0.129 0.030 0.030 (0.067, 0.184) 
6 12 79 0.174 0.156 0.034 0.038 (0.090, 0.217) 0.151 0.036 0.043 (0.087, 0.222) 
7 8 69 0.101 0.118 0.028 0.033 (0.062, 0.171) 0.111 0.028 0.029 (0.061, 0.167) 
8 5 62 0.081 0.107 0.030 0.040 (0.051, 0.165) 0.102 0.030 0.036 (0.050, 0.159) 
9 28 123 0.228 0.195 0.034 0.047 (0.138, 0.265) 0.189 0.035 0.052 (0.123, 0.255) 
10 10 111 0.090 0.107 0.024 0.029 (0.062, 0.156) 0.104 0.025 0.029 (0.051, 0.149) 
ital 16 122 0.131 0.132 0.024 0.024 (0.086, 0.179) 0.126 0.025 0.025 (0.083, 0.179) 
12, 20 137 0.146 al 0.143 0.024 0.024 (0.095, 0.191) BONS? 0.025 0.027 (0.091, 0.189) 


Note: The four models are: Model 1 - no restriction; Model 2 - fixed 8; Model 3 - informative prior for 8; Model 4 - uniform prior for 0. 
Domains are formed by crossing school (middle school - M, high school - H), race (white - W, black - B, mexican american - M) and sex 
(male - M, female - F). Thus, the domains are: 1-MWM, 2-MBF, 3-MMM, 4-MWF, 5-MBM, 6-MMF, 7-HWM, 8-HBF, 9-HMM, 10- 
HWF, 11- HBM, 12-HMF (e.g., the first domain consists of middle school white boys). 7 is the number of teenagers and s the number of 
obese teenagers in each domain. Data are taken from the 35 largest counties in the US. An estimate of the overall probability is 


130/959 ~ 0.136, and for the first domain Pp = 4/47 = 0.085; 


RMSE = ./(% — PM)* + PSD”. 


We also study very briefly the nuisance parameter 0. We 
note that the weighted average of the direct estimators of the 
small areas is 0.136 (more accurately 0.1355599). When 0 
is held fixed at 0.1355599, the weighted average of the 
posterior means is 0.136. When @ has the informative prior, 
the weighted average of the posterior means is 0.136, and 
for 0 the PM is 0.136, the PSD is 0.008, and a 95% HPD 
interval for 6 is (0.122, 0.152). When ® has the uniform 
prior, the weighted average of the posterior means is 0.132, 
and for @ the PM is 0.131, the PSD is 0.011, and a 95% 
HPD interval for @ is (0.110,0.151). This shows the 
deficiencies of the uniform prior which we use only for 
comparison. It is worth noting that p1,,..., 4, and 0 are 
computed first. Then 1, is obtained by subtraction. This is 
done at each iterate of the Gibbs sampler. Then, the 
posterior summaries for Y;,@; x, and @ are computed. So 
there will be very minor discrepancies which are due to 
rounding. 


the numerical standard errors are all smaller than 0.001; 


Finally, we have selected the four smallest domains to 
compare the posterior densities of the probabilities. We have 
used the Parzen-Rosenblatt kernel density estimator to 
estimate the posterior densities; see Silverman (1986) for 
details. Figure 1 compares the estimated posterior densities 
for the four models. It is interesting that as the domain sizes 
increase, the four models get closer together. Also, for all 
cases the tails of the distributions in each panel are very 
similar; the differences in these distributions though lie in 
the modal intervals (i.e., interval containing the mode), and 
their heights. As expected, the posterior density correspon- 
ding to the unrestricted model is the shortest, simply 
because it has more variability. Model 4 has posterior 
density shifted to the left and is slightly bimodal for the 
smallest domain. Thus, inference about the modes of these 
distributions will be different. But inference involving the 
tails will not be so different; except for Model 4, 95% 
credible intervals will be similar. 
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Figure 1 Plots of the estimated posterior densities of %,, %, 74, and 17 for the four models and NHANES data 


3.2 Simulation study 

We use a simulation study to assess the statistical 
properties of our method. We want to see if the gain in 
precision persists and to see how the estimators of the 
probabilities are shifted. We also study the frequentist 
properties of the estimators of the probabilities. In the 
description of the simulation it is convenient to use the 
abbreviated names of the models which are UR (Model 1, 
no restriction), FI (Model 2, fixed 0), IN (Model 3, 
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informative prior for 6) and UN (Model 4, uniform prior 
for 0). 

We set 6, =0.15, pb) = 0, and t, = 100. We have 
selected three values of ¢ = 12, 24, 36, 12 being the number 
of areas in the NHANES data. We drew the sample sizes 
from a uniform density in (25, 150), again to reflect the 
NHANES data. First, we generated 

iid 
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To do this latter task, we drew sets of ¢ 7, until 0, — 
W, S Di, 1, < 0; set t,= (0) - Li1@,2,) /w,. Then, 
we generated 
ind 

s; ~ Binomial (n,, 7). 
We have generated 1,000 data sets in this manner for each 
of ¢ =12,24,36. Then, we fit the four models (one 
unrestricted and three restricted models). The process is 
very fast (i.e., for samples sizes of 12, 24,30 there were 
respectively 22, 90, 153 rejects in the 1,000 samples). We fit 
each data set using random samples for the unrestricted 
model and the griddy Gibbs sampler for the restricted 
models. We fit the 1,000 data sets in a couple of hours on 
our on our 2 x 833 MHz alpha computer. 

For these 1,000 simulations we study PM, the coverage 
(C), the bias (B), PSD, RMSE and width (W) of the 95% 
credible intervals. For each domain we compute the bias 
PM — 7x, then we average these values over all domains 
and simulation runs, and this quantity we now call B. 
Associated with B we also computed 4B, the average of 
| PM — |. Similarly, we have computed 


RMSE = ,/(PM — 2)? + PSD? 


for each domain and each simulation run and we average 
these over all domains and simulation runs. Note that the 
true probabilities, m,, are known by design. We obtain the 
coverage (C) by computing the proportion of all intervals 
containing the true value of zm, over all domains and 
simulation runs. We also obtain the average of the widths of 
the 95% credible intervals. Numerical standard errors are 
obtained for all quantities. 


Table 2 
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In Table 2 we study the estimates of the small area 
probabilities. It is convenient to use the shorter names of the 
four models for our discussion. For IN the PMs are close to 
the nominal value of 0.15, but for UN the PMs are smaller 
than the nominal value particularly for UN at ¢ =12. We 
observe that the coverage for all the models UR, FI and UN 
are always larger than the nominal value of 95%, but for 
model IN these coverages are smaller than the nominal 
value of 95%. A similar difference exists for the bias; while 
the bias is small for all models, models UR, FI (the specified 
value of 8 is 0.15) and UN have negative biases but IN has 
positive bias. Except for ¢ = 36 IN has the largest AB. The 
PSDs are mostly similar and the RMSEs share the same 
features; there are some differences at ¢ =12. The four 
models get similar as ¢ increases; when / is large there 
appears to be no need for our method. However, again the 
gain in precision appears to be in the increasing order FI, IN, 
UN and UR. 

In most applications the exact value of 8 is unknown. 
Therefore, the PSDs of the m,, under the situation where 0 
is assumed known, are likely to underestimate the true 
PSDs. So we study the deviations of the PSDs of IN and 
UN from those of FI, and we compute the ratios, Raa 
PSD /PSD,, and R, = PSD, / PSD,,. In Table 3 we 
present the five-number summaries of these ratios by 
sample size. Most of the ratios are around 1 (i.e., inter- 
quartile range) with some tendency for them to be larger 
than 1. (Note that the maxima at @ = and ¢ =24 are out- 
liers possibly due to bad simulated samples.) Thus, overall 
the PSDs under IN and UN are not much larger under FI. 


Simulation: Comparison of the four models using coverage (C), bias and average absolute bias (B and AB), posterior standard 
deviation (PSD), root mean squared error (RMSE) and width of the 95% credible intervals (W) of 7; 


L Model C B AB PSD RMSE W 

12 UR 0.960o 0018 -0.0026 p03 0.023 lo oo016 0.0330 0001 0.0436 0004 0.125 6.0003 
FI 0.961ooo18 -0.000o 0003 0.02199 00020 0.03 1.0001 0.040 0001 0.118p.0003 
IN 0.9469 0021 0.0059 0003 0.02750 00066 0.0320 0001 0.0439 0001 0.1229 oo02 
UN 0.9569 0019 -0.0009 0003 0.02616 oo019 0.0320 .0001 0.0429 oo01 0.1226 0903 

24 UR 0.957.0013 -0.001 0002 0.0229o oo012 0.03 10,0000 0.041 0001 0.1190 0002 
FI 0.957.0013 -0.0000 0002 0.0224 00013 0.0300 .o000 0.0400 0001 0.1160,0002 
IN 0.9430.0015 0.0060 .0002 0.0252 .000s8 0.0300.0000 0.041.001 0.1160,0001 
UN 0.95200014 -0.000o 0002 0.0236o.00012 0.0310 0002 0.041 0002 0.118 000s 

36 UR 0.960o.0010 -0.001 6.0001 0.02246 oo009 0.030o,0000 0.0400 0001 0.117.001 
FI 0.961 0010 -0.000o 0001 0.0218 oo009 0.030p.0000 0.0396 0001 0.115o,0001 
IN 0.948  oo12 0.0059 0002 0.0224 oo009 0.030 oo00 0.040o 001 0.1146 0001 
UN 0.9570 .0011 -0.000o 0001 0.0228 .o0010 0.0300 0000 0.0400 0001 0.1169.0001 


Note: The four models are: Model 1 - no restriction (UR); Model 2 - fixed @ (FI); Model 3 - informative prior for @ (IN); Model 4 - uniform 
prior for 8 (UN). RMSE= ./(x — PM)? + PSD*. The notation a, means a is anestimate and b is the standard error. 
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Table 3 

Simulation: A study of the posterior standard deviation (PSD) 
of the 1; using five number summaries of the ratios, R, and 
R, , by sample size 


y Ratio Min 0, Med QO; Max 
12 R, 0.673 0.972 1.032 1.091 S39) 
Ry 0.022 0.984 1.034 1.086 85.370 
24 R, 0.019 0.965 1.005 1.047 16.017 
R, 0.024 0.979 1.014 1.049 486.960 
36 R, 0.690 0.962 0.998 1.034 1.236 
Ry 0.837 0.979 1.011 1.044 1.243 
Note: R; = PSD,y/ PSDy, and R, = PSDyy/ PSD». The five 


summaries are minimum (min), first quartile (Q,), median 
(med), third quartile (Q;) and maximum (max). 


In Table 4 we study the estimate of @ for the two 
pertinent models IN and UN. For both models the coverage 
probabilities are smaller than the nominal value, and the 
coverage for UN is smaller than the interval for IN. Bias is 
small for both models, positive for IN and negative for UN. 


Table 4 
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Except for ¢ =36 IN has by far the larger AB. The PSDs 
and RMSEs are generally smaller for IN, and the widths of 
the 95% credible intervals are significantly smaller for IN. It 
appears that it is difficult to estimate 8 under UN, but IN 
appears to be somewhat better. 

In Table 5 we present more detailed result (ie. by 
domain) for the case when the number of domains is 12. To 
show further gains in precision, we have reduced the sample 
size to half as much [i.e., we drew the sample sizes uni- 
formly in the interval (12, 75)]. We present the posterior 
standard deviation and the posterior root mean square error, 
averaged over the simulation runs. Again the standard errors 
are presented. We note that all the probability contents (not 
presented) are at least the nominal value of 95%. The 
numerical standard errors are small in all cases. The PSDs 
and RMSEs are in the right order. Note that because the 
sample sizes are arranged in order from smallest to largest, 
there is a decrease in the PSDs and RMSEs as the domain 
numbers go up. 


Simulation: Comparison of the informative (IN) and the uniform (UN) models using posterior mean (PM), coverage (C), bias and 
average absolute bias (B and AB), posterior standard deviation (PSD), root mean squared error (RMSE) and width of the 95% 


credible intervals (W) of 1; 


v7 Model PM C AB PSD RMSE WwW 

12 IN 0.1499 012 0.8530 0112 0.000p 0003 0.001526 00081 0.008 oo00 0.0125 oo02 0.030.001 
UN 0.138 9.0005 9-881o.o102 = -0.0120.0004 ~=—-—-9.00038 9.00003 =~ 9.01 Lo.0001 0.0160.0002 9.042 0002 

24 IN 0.15300015 9-833 o.0118 0.003.001: 9-00212o.00103 —9-0070.0006 = 9.0120.0015 += 9-024 .0015 
UN 0.145 6.0029 =: 9.842 o115 -0.0050.0003 9.000129 .00006 9.008 0.0001 0.012 90002 9.030.002 

36 IN 0.150 6.0002 0.828 0119 0.000 oo02 0.000049 00000 9004 o000 0.007 0001 0.017 9.0001 
UN 0.1450.0003 _9.794o.0128 -0.0050.0002 9.00009 6.00000 9-0060.0000 9.010.000: 0.0240 0001 

Note: The two models considered are: Model 3 — informative prior for ® and model 4 - uniform prior for 8. RMSE = 


(8) — PM)? + PSD*. The notation a, means a is an estimate and 5 is the standard error. 


Simulation: Comparison of the four models using posterior standard deviation and root mean square error (RMSE) of 1; by 


Table 5 
domain (D) 
Unrestricted Fixed 

D PSD RMSE PSD RMSE 
] 0.048p 0003 0.057.004 0.046o 0003 0.0540 0004 
2 0.046.003 0.0550.0004 0.044o 0003 0.0530.0004 
3 0.044 oo02 0.0530.0004 0.0425 0002 0.0500.0004 
4 0.0425 0002 0.0500 .0004 0.0400 .0002 0.0470 .0004 
5 0.0419 0002 0.0499 0004 0.038 0002 0.0460 .0004 
6 0.040p 0002 0.0480 0004 0.037.0002 0.0450 0004 
7 0.038 .0002 0.0460 .0004 0.035,0002 0.043 .0003 
8 0.0370 .0002 0.0450 .0003 0.034 0002 0.041 0003 
9 0.0360 0002 0.0445 0003 0.0330 .0002 0.0406 0004 
10 0.035 .0002 0.0430 .0003 0.0325 .0002 0.0399 .0003 
11 0.0349 oo01 0.0429 0003 0.03 1o.0002 0.038 .0003 
12 0.0350 0002 0.0470 0005 0.03 10,0002 0.0425 .0004 


Informative Uniform 
PSD RMSE PSD RMSE 
0.0450 0002 0.056.005 0.0470 0004 0.056.005 
0.044 o002 0.0540 0005 0.045 0004 0.054.005 
0.0426 0002 0.052.005 0.0435 .0003 0.05 10.0004 
0.0406 .0002 0.0500.0004 0.0416 0002 0.0499 0004 
0.039 .0002 0.0480 0004 0.0399 .0003 0.0480 0005 
0.037 .0002 0.0480 .o004 0.038.003 0.047.005 
0.0360 .0002 0.0460 0004 0.037.003 0.0450 0004 
0.036 .0002 0.0460 .0004 0.036.003 0.044 0004 
0.034 .0001 0.044 0004 0.035.003 0.0425 0004 
0.0340 .0001 0.0446 0004 0.0340 .0003 0.042 0004 
0.033o.0001 0.0425 oo04 0.0330 0003 0.041 0004 
0.034 0003 0.047.006 0.034.007 0.046p oo08 


Note: The four models are: Model | - no restriction; Model 2 - fixed 9; Model 3 - informative prior for 8; Model 4 - uniform prior for 
0. RMSE= ,/(1; — PM)* + PSD*. The notation a, means a is an estimate and b is the standard error. Here 12 domains are used 
and the original simulated sample sizes are divided by 2. 
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We study the posterior density of x, for ¢ = 12, and we 
compare the four models. Again we use the Parzen- 
Rosenblatt density estimator. In Figure 2 we present the 
estimated posterior densities (Parzen-Rosenblatt) averaged 
over the 1,000 runs for ¢ = 12. We obtain the same results 
as for the BMI data. Again the tails are similar. FI is the 
tallest density and UN is the shortest. UN is slightly shifted 
to the left of IN. In Figure 3 we present a systematic sample 
of 10 densities from the 1,000 simulation runs by model. 
We can see large variation among the 10 estimated posterior 
densities. Again we can see that FI is tallest; UR, FI and UN 
show similar variation with IN slightly taller. Thus, it is 
important to take the average for comparison as in Figure 2. 


Density 
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4. Concluding remarks 


We have extended the beta-binomial model of small area 
estimation to accommodate a prior specification of a 
weighted average of the area probabilities. We have used 
the Bayesian approach which is particularly attractive for 
problems with awkward likelihood functions as in our 
application with the constraint of the weighted average of 
the beta-binomial model. We viewed the constraint as prior 
knowledge which can be precise or less informative. The 
griddy Gibbs sampler is used to fit the models, thereby 
avoiding the more sophisticated Metropolis-Hastings sam- 
pler. We have developed a theory which permits sampling 
from a density function which is proportional to the product 
of a truncated beta-binomial density and a generalized beta 
density. We have found that overall our complete algorithm 
forming the griddy Gibbs sampler runs efficiently and fast. 
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Figure 2 Plots of the estimated posterior densities of m™, by model when there are 12 domains 
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Figure 3 Plots of the estimated posterior densities of 1, for a systematic sample of size 10 from the 1,000 runs by model when 


there are 12 domains 


We have shown that there could be gains in precision 
when extra information is incorporated into the beta-bino- 
mial model. We have considered three scenarios in which a 
survey practitioner (a) can not specify any constraint (stan- 
dard beta-binomial model for small areas), (b) can specify a 
constraint and the parameter completely, and (c) can specify 
a constraint and information which can be used to construct 
a prior distribution for the parameter. Our example on 
obesity of children in the National Health and Nutrition 
Examination Survey and simulation study showed that the 
gain in precision beyond (a) is in an order with (b) larger 
than (c). As the exact algebraic arguments are difficult, we 
obtained an analytical approximation which shows that 
indeed there could be gain in precision of (b) over (a). For 
comparison we have considered a fourth scenario in which 
0 has vague information, and as expected, it turned to be 
rather uninteresting and inefficient. 
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It is straight forward to make Bayesian predictive infer- 
ence about the finite population mean of each small area. 
Let P =T,/N, denote the finite population proportion for 
the i" area, where 7, = ype, y, are the binary re- 
sponses, and N,, the number of individuals in the i" area, 
is assumed known. Now 7, = ¢,") + 4°"), where 4) and 
t"") are respectively the sample total and the nonsample to- 
tal. Now under any of the models ¢{"") | x, ~ Binomial (n,, 
n,) and p(t!"| y,) = {p(ti"| m,) p(n | y,)dm, where 
y, = (,---» »,)’. Thus, it is easy to obtain the empirical 
posterior density of P using a sampling-based method. 
Nandram and Sedransk (1993) obtained some analytical 
features of P- when t is known, but not with the constraint; 
see also Nandram (1998). 

We mention a generalization of our restricted beta-bino- 
mial hierarchical Bayesian model to the Dirichlet-multi- 
nomial model (e.g., Nandram 1998). Let y, be c-vector of 
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cell counts (i.e., number of people possessing one of c 
traits), and let , denote the sample sizes within the i" 
area, 7 = 1,..., 4. We assume 
ind lid 
y;| %, ~ Multinomial(n,, 1,), ,| u, t, 8 ~ Dirichlet (ut) 

with Y/,w,2, = 0. Finally @~ Dirichlet (HT), Where 
Li) and t, are to be specified, o cent P(H, T) = 
(k-1)'/(1 +07, 0<p,<1,k= ne, ep leselee With 
k constraints this problem is i more complex, but we 
plan to work on it. Other extensions to nonignorable non- 
response (Nandram and Choi 2002) and two-way categori- 
cal tables are possible. 
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Appendix A 
Proofs of lemmas 1, 2 and theorems 1, 2 


Proof of lemma 1 


This is a special case of a general result. Using the 
multiplication rule and because the prior is proper, it is clear 
that the joint density of 1, ,t,s “integrates” to one. 
Therefore, the joint posterior density of 7, 1,7 given s is 
proper. 


Proof of theorem 1 


Peteiins (TIT, OO Tal, aml, af, 0. U< 
ip ald Ow ep}, 0 <10 <1, ot) Ae 
Ps Or @,) Yand (T= 4 (ae, Leas goes. 1 
ee, 0 etl els note that eer: 

tet &(m,u,t|s) denote the right-hand side of the 
unrestricted posterior density in (7) and P(T%w, LT, 9 | 8, 
=0) denote the right-hand side of the ne posterior 
density in (9). Noting that x, =(@->/! ©; 7;)/@,, We 
observe that 


P(N, HT, 9 | s,6=0)= 
& (a, p, | s)x OO (1 — 6) "(7 2, 8) & T. 


Because 9" ' (1 — gy to! jg proportional to the 


density function of beta random variable, we have 
[,P(R)> st 8| 8, 6 = 0) dndudrdd = 
Al _&(n, u,t|s)dndudt < Al_.&(n, u,t | s)dndudt, 
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where A = B{uot),(1—U,))t)} is the beta function. By 
lemma 1, le -£ (1, U, T| 8) dndudt < oo. Thus, P(Ryy, Ms 
t, 8 | s, = 0) is proper. 


Proof of Lemma 2 (a) 


This can be proved in two ways. The second derivative 
of log{f,(x)} is negative in (c,d), and so the first 
derivative, when set to zero, provides a unique mode which 
is d+ (1 —8)c. Alternatively, because (X — c)/(d —c)~ 
Beta (a,b) with a,b >1, there is a unique mode for 
(X —c)/(d—c), and this translates to 5d + (1 — 8)c; 
note that 6d+(1—8)c is a point in (c,d). Thus, 
substituting dd + (1—6)c into f,(x), we have 

sup f,(x) = 5° '(1- 8)’ '/ (d —c) B(a, b). 


c<x<d 


Proof of Lemma 2 (4) 


Because’ a,b >1,x>x-—e¢ and 1—-x>d-—x, itis 


true that 
yes oie ite: = e)*t82(g — x)? dy. 


where D = (d—c)*"’"' B(a, b) B(g, h){ F, ,(d) - 
and F 


(c)j 
Ls h s 
(x) is the cdf of a standard beta random variable in 


gh 
(0, 1). Note that because c < d cont and F’,,(x) is 
monotone increasing in (0,1), F,,(d) — FT, kc) > 0 


(strictly). By comparison with the generalized boa density 
lie, Betata+g—1,b+h-1,c,d)], the integral is 
(=e) ©" Bia +o 1 be nes) Thus 


, (d- 08" Bla+ g-1,b+h-1) 


= St ey 
B(a, b)B(g, h){F, ,(d) —F, ,()} 
Also, we have 
A" < [°f(x) sup fixate 
c<x<d 
Then by Lemma 2 (a), 

To els) bagel 0 a 5) 

l ————————___ 
zi (d —c)B(a, ni PES Tea ter VB a. b) 


= ihe < Co, 
Proof of theorem 2 


To show the claim, we calculate the cdf F,(-) of the 
random variable X defined in the Theorem. We have 

F(x) = P(Xs x) 

= P[F,\{UF, ,(d) + I- U)F, ,()}< x] 

SPU MG at (Ue) Pc) Sn ()] 


Se eve g, (a) — g, Fate) Sah aH) Bane F,,(€)] 
He ovat les Fal) 
Fa nic) | 
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Now, since U ~ Uniform(0,1), from the above expres- 
sion for F,,(-), we have Fy (x) =1 if x 2d and Fy (x) = 
Oif x < c. When c < x < d, wehave 

Fax) SFG, (e) 


EG) a 
‘ . ok (d) - Le, (c) 


This shows that X has the truncated beta density /,(x) 
in (20). 
Now, looking to use the accept-reject algorithm, consider 


f() 
= Afa(x): 
A(x) do 


By Lemma 2, we have 


a-l b-l 
sup [42] = Asup f= yee Oe 100) 


SALES, pees (d —c) B(a, b) 


Thus, by the accept-reject algorithm, if 


ye 1 (A=e) (4) 
OC =e)" = aa io. 


then XY has the density f(x) in (19). 


c<a<d 


References 


Cochran, W.G. (1977). Sampling Techniques, third edition. New 
York: John Wiley & Sons, Inc. 


Gilks, W.R., and Wild, P. (1992). Adaptive rejection sampling for 
gibbs sampling. Journal of the Royal Statistical Society, Series C, 
41, 337-348. 


Gelman, A. (2006). Prior distribution for variance parameters in 
hierarchical models. Bayesian Analysis, 1, 515-533. 


Ghosh, M., Natarajan, K., Stroud, T.W.F. and Carlin, B.P. (1998). 


Generalized linear models for small-area estimation. Journal of 
the American Statistical Association, 93, 273-282. 


Statistics Canada, Catalogue No. 12-001-X 


Hillmer, S.C., and Trabelsi, A. (1987). Benchmarking of economic 
time series. Journal of the American Statistical Association, 82, 
1064-1071. 


Lazar, R., Meeden, G. and Nelson, D. (2008). A noninformative 
Bayesian approach to finite population sampling using auxiliary 
variables. Survey Methodology, 34, 51-64. 


Nandram, B. (1998). A Bayesian analysis of the three-stage 
hierarchical multinomial model. Journal of Statistical 
Computation and Simulation, 61, 97-126. 


Nandram, B., and Choi, J.W. (2002). Hierarchical Bayesian 
nonresponse models for binary data from small areas with 
uncertainty about ignorability. Journal of the American Statistical 
Association, 97, 381-388. 


Nandram, B., and Choi, J.W. (2002). A Bayesian analysis of a 
proportion under non-ignorable non-response. Statistics in 
Medicine, 21, 9, 1189-1212. 


Nandram, B., and Sedransk, J. (1993). Bayesian predictive inference 
for a finite population proportion: Two-stage cluster sampling. 
Journal of the Royal Statistical Society, Series B, 55, 399-408. 


Nandram, B., Toto, M.C.S. and Choi, J.W. (2011). A Bayesian 
benchmarking of the Scott-Smith model for small areas. Journal 
of Statistical Computation and Simulation (in press, preprint). 


Rao, J.N.K. (2003). Small Area Estimation. New York: John Wiley & 
Sons, Inc. 


Ritter, C., and Tanner, M.A. (1992). The gibbs sampler and the griddy 
gibbs sampler. Journal of the American Statistical Association, 87, 
861-868. 


Robert, C.P., and Casella, G. (1999). Monte Carlo Statistical 
Methods. New York: Springer-Verlag. 


Silvapulle, M.J., and Sen, P.K. (2006). Constrained Statistical 
Inference: Inequality, Order and Shape Restrictions. New York: 
John Wiley & Sons, Inc. 


Silverman, B.W. (1986). Density Estimation. London: Chapman and 
Hall. 


Survey Methodology, December 2011 
Vol. 37, No. 2, pp. 153-170 
Statistics Canada, Catalogue No. 12-001-X 


153 


On bias-robust mean squared error 
estimation for pseudo-linear small area estimators 


Ray Chambers, Hukum Chandra and Nikos Tzavidis ! 


Abstract 


We propose a method of mean squared error (MSE) estimation for estimators of finite population domain means that can be 
expressed in pseudo-linear form, i.e., as weighted sums of sample values. In particular, it can be used for estimating the 
MSE of the empirical best linear unbiased predictor, the model-based direct estimator and the M-quantile predictor. The 
proposed method represents an extension of the ideas in Royall and Cumberland (1978) and leads to MSE estimators that 
are simpler to implement, and potentially more bias-robust, than those suggested in the small area literature. However, it 
should be noted that the MSE estimators defined using this method can also exhibit large variability when the area-specific 
sample sizes are very small. We illustrate the performance of the method through extensive model-based and design-based 
simulation, with the latter based on two realistic survey data sets containing small area information. 


Key Words: Best linear unbiased prediction; M-quantile model; Model-based direct estimation; Random effects 


model; Small area estimation. 


1. Introduction 


Linear models, and linear predictors based on these 
models, are widely used in survey-based inference. Howev- 
er, such models run the risk of misspecification, particularly 
with regard to second order and higher moments. Bias- 
robust methods for estimating the mean squared error 
(MSE) of linear predictors of finite population quantities, 
i.e., methods that remain approximately unbiased under 
failure of assumptions about second order and higher mo- 
ments, have been developed. Valliant, Dorfman and Royall 
(2000, Chapter 5) discuss bias-robust MSE estimation for 
such predictors when a population is assumed to follow a 
linear model. 

In this paper we address a subsidiary problem, which is 
that of bias-robust MSE estimation for estimators of finite 
population domain means that can be expressed in pseudo- 
linear form, i.e., as weighted sums, but where the weights 
can depend on the sample values of the variable of interest. 
An important application, and one that motivates our 
approach, is small area inference. Consequently from now 
on we use ‘area’ to refer to a domain of interest. Our ap- 
proach represents an extension of the ideas in Royall and 
Cumberland (1978) and appears to lead to simpler to 
implement MSE estimators than those that have been 
suggested in the small area literature. 

The structure of the paper is as follows. In section 2 we 
discuss MSE estimation under an area-specific linear model. 
That is, we focus on estimation of the conditional MSE. We 
then show how our approach can be used for estimating the 
MSE of three different small area linear predictors when 


they are expressed in pseudo-linear form, (a) the empirical 
best linear unbiased predictor or EBLUP (Henderson 1953); 
(b) the model-based direct estimator (MBDE) of Chandra 
and Chambers (2009); and (c) the M-quantile predictor 
(Chambers and Tzavidis 2006). In section 3 we present 
results from a series of simulation studies that illustrate the 
model-based and the design-based properties of our 
approach to MSE estimation. Finally, in section 4 we sum- 
marize our main findings. Throughout, we use either i or h 
to index the D small areas of interest, and either j or k to 
index the distinct population units in these areas. 


2. Bias-robust MSE estimation for 
pseudo-linear estimators 


2.1 MSE estimation under an area-specific linear 
model 


We consider the situation where we have a finite 
population of size N from which a sample of size n is drawn. 
We assume that this population consists of D non-over- 
lapping domains, each one of which contains sampled units, 
with small realised sample sizes in each of the sampled 
domains. As noted earlier, and following standard practice, 
we refer to these domains as areas from now on. We assume 
also that there is a known number N, of population units in 
area i, with n, of these sampled. The total number of units 
in the population is N = ¥?,N,, with corresponding total 
sample size n = )?,n,. In what follows, we use s to denote 
the collection of units in sample, with s, the subset drawn 
from area i, and use expressions like j €¢i and j es to 
refer to the units making up area i and sample s respectively. 
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Linear models are often used to motivate estimators for 
population means. However, when estimates are required 
for the corresponding area means, it is usually not realistic 
to assume that a linear model that applies to the population 
as a whole also applies within each area. We therefore adopt 
a conditional approach, and consider MSE estimation for 
estimators of area means when different linear models apply 
within different areas. In particular, we focus on estimators 
that can be expressed as weighted sums of the sample 
values, referring to them as ‘linear’ in what follows to indi- 
cate that they have a linear structure. 

To start, let y, denote the value of Y for unit j of the 
population and suppose that this unit is in area i. We also 
assume an area-specific linear model for y, of the form 
Nin Pee (ap) 


J 


Here x, isa p x1 vector of unit level auxiliary variables 
for unit j, B, isa p x1 vector of area-specific regression 
coefficients and e, is a unit level random effect with mean 
zero and variance oO; that is uncorrelated between different 
population units. We do not make any assumptions about 
Oo; at this point. Note that throughout this paper we assume 
that the sampling method used is non-informative for the 
population values of Y given the corresponding values of the 
auxiliary variables and knowledge of the area affiliations of 
the population units. As a consequence, (1) applies at both 
sample and population level. 

Let y, denote the column vector of sample values of y, 
and let w, = {w,; j € 5} mate the column vector of 
fixed weights aa, re fw eye: w,,y; 18 a linear 
estimator of m, =N;' Djciy; By ‘fixed’ here we mean that 
these weights do not depend on the sample values of Y. 
Moreover, we assume w, = O(n,') for jes,,w, = = 0(n;') 
for j¢s,, and > jes W, = re Here s, denotes the n, sample 
units from area i. The bias of ™, under (1) is then 

E(m, — m;) = [alee Se 77 B, } mae WAZ) 
where xX, denotes the vector of average values of the 
auxiliary variables in area i. Similarly, the prediction 
variance of m, under (1) is 


=) D D2 2 
N; eae Gy + wie 5, (3) 
where 7, 


- denotes the non-sampled units in area i and 
a, =N,w, —I(j¢i). We use I(A) to denote the indicator 
function iat event A, so /(j € i) takes the value | if popu- 
lation unit j is from area i and is zero otherwise. Note that 
since a, is O(N; n,') for j € 8, the first term within the 
braces 1 i (3) is the leading term of this prediction variance if 
N, is large compared to 7,. 


Var(m, — m,) = 
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Let j <¢h. We consider the important special case 

where p= E(y,| x;) = x; "B, is estimated by 

5 Be Set Ves with the ,; corresponding to 
eee weights. Then 


alts OF Ny cs Dice pty 


yy; =, pL, 


and so 
Varly,~ fi,) = 0) {0 by) + Dae ni (t!o)} 


under (1). Here s(—j) denotes the sample s with unit / 
excluded. Ifin addition fi, is unbiased for 1; under (1), i.e., 


E\y, T fl) = 0, (5) 


we can then adopt the approach of Royall and Cumberland 
(1978) and estimate (3) by 


Vin) = Fe ar pas ay hi (y vie tp ree 63, (6) 


where h, =(e di) + Likes(- tue and ¥ Ty = S; (nies 

Usually, ‘the estimates 6* ; of the residual variances in (6) 
are derived under a ‘working model’ refinement to (1). In 
the situation of most concern to us, where the sample sizes 
within the different areas are too small to reliably estimate 
area-specific variability, a pooling assumption can be made, 


i2., 07 = o’, in which case we put 
-1 
Pe urerdl as Ny 2 Ae 
SiS se tt ail = ;) a smealy P Vig H;) ‘ 


In this case (6) becomes 
Vm) =NPY {Gr 


where now i, =0h= 4) + Yin a: Since any as- 
sumptions regarding oO; in the working model extension of 
(1) only affect second order terms in (3), the estimator (7) is 
bias-robust, i.e., it remains approximately unbiased under 
misspecification of the second order moments of this 
working model. 

A corresponding estimator of the MSE of m, under (1) 
follows directly. This is 


(N,— nna, HY, M 


M(m,) = V(m,) + B’(m,), (8) 


B(m;) 257 Lae Wilt; = in me ey (9) 
is the obvious unbiased estimator of (2). 

Use of the square of the unbiased estimator (9) of the bias 
of 7, in the conditional MSE estimator (8) can be criticised 
because this term is not itself unbiased for the squared bias 
term in MSE. This can be corrected by replacing (9) by 
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M(m,) = V(m,) + B?(m,) — V{BU,), (10) 


where V{B(m, )} is a suitable estimator of the variance of 
(9). However, we do not recommend use of (10). To see 
this, let. B = D>? ‘By and put d, =f, —B, where B, 
is the estimator of B, implied by the vas o,;- Further- 
more, put w,, = Ze Wi and | X= > jes, Wy Xj» SO 
=i. 1Lijes, WyX; = CiiWy;Xyx; iS the estimate of X x, 
bowed on the weights w,. Finally, let 5,, = X,d, — X/d, 


and put 6, = )P. em Then (9) can be written 


Bim) = (x,,-*)'B 
D —fi —T 
HD p Wri X pi x; d; 
= (X,,— ¥)' B 


D = = ir D —T a G 
Hyp, Whi (Xni-%),) diya h=1 WiiXnh d,—X; d, 


= (x) B 
Die Whi (a) a, ats 6,. (11) 
Typically, D will be large and the leading term in the 


variance of (9) will be the variance of 6, in (11). If this 
leading term is large, then V {Bm )} will aise be large, and 
(10) could take negative values. We therefore recommend 
that (8), rather than (10), be used. An immediate conse- 
quence is that (8) is then a conservative estimator of the 
MSE of m, under (1). This may be acceptable provided that 
the variance of 5, is small. However, for very small values 
of n, this variance can be large, causing (8) to substantially 
overestimate the actual MSE of m,. We therefore recom- 
mend a preliminary empirical assessment of the size of the 
variance of 6, relative to the value of (7) in this situation. If 
this assessment indicates that the variance of 5, dominates 
(7), then (8) should not be used. 


2.2 MSE estimation for pseudo-linear small area 
estimators 


The approach to conditional MSE estimation outlined in 
the previous sub-section assumed that the weights defining 
the linear estimator m, do not depend on the sample values 
of Y. However, most small area estimators do not satisfy this 
condition, in the sense that they are pseudo-linear in 
structure, with weights that do depend on these sample 
values. For example, the Best Linear Unbiased Predictor 
(BLUP) of m, under the linear mixed model variant of (1) 
where the area-specific regression parameters B, are 
independent and identically distributed realisations of a 
random variable with expected value B and covariance 
matrix I’, can be written as a weighted sum of the sample 
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values of Y where the weights depend on IT (see Royall 
1976). Consequently, the empirical version of this predictor, 
the widely used EBLUP, is computed by substituting an 
efficient sample estimate of I’ (e.g., the REML estimate) 
into the BLUP weights. If the linear mixed model 
assumption is true, this sample estimator of [converges to 
the true value and consequently the EBLUP weights 
converge to the BLUP weights. That is, for large values of 
the overall sample size n, we can treat the EBLUP weights 
as fixed and use the MSE estimator (8) for the EBLUP. Of 
course, the EBLUP weights are not really fixed, and so (8) 
is therefore an approximation to the true MSE of the 
EBLUP that ignores the contribution to this MSE arising 
from the variability in estimation of IT. However, this 
potential underestimation needs to be balanced against the 
bias robustness of (8) under misspecification of the second 
order moments of Y. 

An important advantage of (8) is that it can be used with 
a range of small area estimators that can be expressed in 
pseudo-linear form. In particular, many small area 
estimators developed under models that are variants of (1) 
can be written in this form, i.e., as weighted sums of the 
sample values of Y. To illustrate, we now focus on three 
such estimators: the EBLUP (Rao 2003, Chapter 6), the 
Model-Based Direct Estimator (MBDE) of Chandra and 
Chambers (2009) and the M-quantile predictor of Chambers 
and Tzavidis (2006). Each of these estimators can be written 
in pecia -linear form, with Roe that satisfy w, 
O(n; ') for j € 5s, and w, = =6o(..) tory &, Srey 
can be used. 


2.2.1 MSE estimation for the EBLUP 


We first consider the well-known EBLUP for m, based 
on a unit level linear mixed model extension of (1) of the 
form 


=X P+ Z,u, +e, (12) 


where y, is the N,-vector of population values of y, in 
area i, X, is the corresponding N, x p matrix of auxiliary 
variable values x,, Z, is the N, x q component of X, 
corresponding to the g random components of B, u, is the 
associated q-vector of area-specific random effects and e, is 
the N, -vector of individual random effects. It is typically 
assumed that the area and individual effects are mutually 
independent, with the area effects independently and 
identically distributed as N(0,Q) and the individual 
effects independently and identically distributed as 
N(0, o”). See Rao (2003, Chapter 6) for development of 
the underlying theory of this predictor. We note that the 
EBLUP can be written in pseudo-linear form, 


~EBLUP _ EBLUP,, _ /,, EBLUP\T 
mM; S Dy eee IN KS (Ww, JED; (13) 
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where 
uae ke (a) 
= N."[A,, + {H7X? +(L, —- Hf X))2,,2,,34,]. 
Here A,. is the vector of size N —n that ‘picks out’ the 


non-sampled units in area i, X, and X, are the matrices of 
order nx p and (N —n)x p respectively of the sample 
and non-sample values of the ES variables, I, is the 
identity matrix of order n, H, =(X72)X,)' XZ, &,, 
6’, +diag{Z,QZ,; i=l,....D} and 2,,=diag{Z,, a7". 
t= 1 ee Here Z,,(Z,) is the castle (non-sample) 
component of Z, and 6° and © are suitable (e.g., ML or 
REML) estimates of the variance components of (12). 

Given this setup, estimation of the conditional MSE of 
the EBLUP can be carried out using (8) with weights 
defined following (13). In turn, this requires that we have 
access to unbiased estimators fi, of the area specific 
individual expected values |1,. Howeres such estimators 
may be unstable when area sample sizes are small. Conse- 
quently, it is tempting to ee fi, by the ee for y,, 
Le, ita 6 Ve EZ j 7 _EBLUP’ where Re TE denotes 
the Empirical Best Linear Unbiased Estimator of B in the 
linear mixed model (12) and Ge denotes the predicted 
area effect for the area i that contains observation 7. Unfortu- 
nately, because of the well-known shrinkage effect asso- 
ciated with EBLUPs, this approach is not recommended. To 
illustrate this, we note that V(7m,) in (8) uses (y, — fi,)” as 
an estimator of E(y,; — H; )’. The bias in this estimator is 
therefore 


E(y, -f,) -E(y;-Hj) 


=-2E(y, eal OE = E(t, —p,) 


=-E{(, —H,)2y; —#H; ALi 

so we anticipate that V(m,) will be negatively biased if 
EX(h; —u,)(2y; —H; —H;)} is positive and vice versa. 
Now let sample unit j be from area i and consider the special 
case of a random intercept model for y,, ie, y; = 
x; B +u, +e, where wu; is the random effect for area i and 
e, is a random individual effect uncorrelated with u,. Here 
Hy, =X; "B+u, Suppose that we have a large overall 
sample ‘size, allowing us to replace Be = by hee lhe 
EBLUP eh el can then be approximated by f,= 
x;B+y,u, Where y, is a ‘shrinkage’ factor. It follows 
that 


(A, » By; * H; a) a 2u;(Y; oa le, =u; (y; -1y 


so E(y, saat — E(y, -,)° * (y; —1)’o,. That is, we 
expect Vm, ) to be positively biased if we use the shrunken 


Statistics Canada, Catalogue No. 12-001-X 


EBLUP 31°"? to define fi,. We also note that this bias 
disappears (approximately) if we ‘unshrink’ the residual 
component of this EBLUP. For example, in the case of the 
popular random intercepts model, we use 


A EBLUE ya ee 


n T QR EBLUE — = 
p,=x7B + Vis re )= Vis + (Xj- 
where y,, and X, denote the sample means of Y and X 
respectively in area i. Given (12) is the working model, a 
general expression for such an ‘unshrunken’ estimator is 

a xp fi 


+2, i; 


A, (14) 
where #,=(Z,Z,,) 'Z, Vx —X,,p°"'"") is the unshrunken 
predictor of the random effect for area i. It is not difficult to 
see that then ft; = Dees), Where b= ¢ I (kei), 

with 


Pai Disk 


Cijs = (Cask k = S) 
= He (B.6 93; 6 Ni {x, = Me (ZZ; ye z) 


and b,, = (bj 43k € 5;) = Z; CAS Z, A z,- Note that these 
4; °S ‘are also used to calculate the value of h, defined 
immediately after (7). 

Finally, we observe that when (14) is used in (8), the 
estimated bias (9) becomes 


i 
B(iin,) = apa wi z,) it, — 7 i 


since the EBLUP weights (13) are ‘locally calibrated’ on_X, 
i.€., wiPIUP x = X,. It follows that in this case the 
variable 5, defined immediately before (11) takes the form 


Jes 
Di Da" 1 


EBLUP => EBLUP 
= jes, "ij 


EPP e = 
Z, Uy, — 2; it 


where w,, . For a large enough overall 
sample size 5, can be approximated by 


8; => Ww 1 Wer he, (La Zan)e Li Ves X,;B) 
a UZ ayZe Ns Zi (Vis i XB) 
= SPB at ty + (ZEZ,) Zhe) 


where we” is the BLUP equivalent of WrPLUP The 
variance of 6, can therefore be estimated via 

V)= Vie PF AQ+ (ZZ) 3%. 15) 
If V(8,) is small relative to the value of (7) in this case, 
then (8) can be used to estimate the MSE of the EBLUP. 
However, when n, is very small, this condition may not 
hold. In such cases it may be advisable to consider more 
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model-dependent MSE estimators like the Prasad-Rao (PR) 
MSE estimator (Prasad and Rao 1990; Rao 2003, section 
7.2.3). When a random means model is assumed, but the 
between area variability is very small relative to the within 
area variability, this advice extends to moderate area sample 
sizes as we now show. 


2.2.2 MSE estimation for the EBLUP under the 
random means model 


The random means model is the ee case of (12) 
where yj =Bt+u, +e, with u; N(O; om) and e, ~ 
N(0, 0”). The EBLUE of B x then 6 = be 1 by Ys with 
&, = (@+n,') fr? (@+n,')'}" and 6 = 6. / 6°, and 
the EBLUP (13) is defined by weights of the form 


wre =(1- f,)1-9,) ym TC Eh) 
+{f,+0-f) 7} TG en 


with 7, =7,o(1+n9) 
VY, and so 


. 2 2 
Kj =(1 =i) ui iee abe 


=(1—n,')° +(n, -Dn? =(n, -1n;. 


For j €h, fh, =Lies dy Vj = 


It follows that the estimator (7) of the conditional prediction 
variance of m"""" in this case is 


Vom) =f) | OP MU 4, Gm? 
+ (N,-n,)'n"'}n,s; 
+ Fun {20-4,)4, + 4,387 |, 
where s; = (m, —1)'Djes, (¥; — Vis), while from (9) the 
estimator of the conditional prediction bias of m gas 


BOmpPY") = (1 — f,)(-9,)(B - F,). For h # i wealso 
then have 


wEBLUP _ play ee 
=(1- f,),(+7,6)" 


when we ignore O(N, ') terms. A similar approximation to 
(15) therefore leads to 


V(8,) = De (Ww ay (62 +17,'6’) 


~ 4, (1+n,@) | 


Suppose now that the sample size in every sand area is the 
same, i.e., n; = m. Then n = mD, &, = D™ and the ap- 
proximation to V5, ) above takes the form 
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re)-6 Dp i 5) (228) =» n'(1+m) | 6? 


hzi\ 1+ mo m 


while the corresponding approximation to Gi ) is 


Vm PP) eS (+mo)? D7 ms? 
+ (1+ m) * (2D '+ mé)s? 
=n'(1 +m@)?|(D" Bed 47 }+mo(2-+nd)s?}. 


Comparing these approximations to V5, ) and V (ni oie 
we see that if m@ is small (e.g., when m and @ are both 
small) then these terms will be of similar magnitude. In this 
situation we expect (8) to overestimate the true MSE of the 
EBLUP. In particular, the approximation to (8) when m@ is 
small and JN, is large is 


M (iP) x m(p Doar (5.-D" DD? In): (16) 


Note that the expectation of the squared residual on the right 
hand side of (16) when m@ is small is (l1— D™')(o? + 

mo”) = O(1) and so it is the leading term in this esti- 
mator in this situation. This expression can be compared 
with the corresponding one for the MSE estimator of the 
EBLUP suggested by Prasad and Rao (1990). Under the 
random means model, the PR MSE estimator is 


EBLUP 
Mop (m,; )= 


= 1— f,)° jm ‘6 


+-4,)(myy. ay + N= fe? 


poole 


AGE Na) eo RIND 
aon Dish + 26°67 m 


where 7, =e + 6 


a 
ID) ja) [BY Bye) D A=) \e 
+( aa, peices \-(DRaati ie 


Assuming n,=m,m@ is small and N, 
a LUP ) has the approximation 


is large, 


Mpp (1m, 
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Mog (2?) & Gin! +2(n—D)'}+6;. (17) 


Comparing (16) and (17) we can see that the instability and 
the overestimation associated with the use of (8) in this 
situation are both due to the use of the square of the single 
degree of freedom area level residual y,, — Dae aas 
an estimator of o;. This reinforces earlier comments that 
(8) should not generally be used for estimating the MSE of 
the EBLUP if the area sample sizes are very small or, in the 
special case of the random means model, for moderate area 
sample sizes when the between area variability is very small 
relative to the within area variability. 


2.2.3. MSE estimation for the MBDE 


The second predictor of m, that we consider is the 
Model-Based Direct Estimator (MBDE) described in 
Chandra and Chambers (2009). This is based on the same 
linear mixed model (12) as the EBLUP, with the MBDE 
predictor defined as 


~MBDE _ MBDE at MBDE \T 
Hie i, Wh a= (Wee Java © es) 
where 


EBLUP 
yee = nail LAE SH a ae SEEM (19) 
1] EBLUP * 
Dee ES;)Wy 


Here /(j €s,) is the indicator function for unit 7 to be in 
the area i sample, and wiP'¥? = (w*®'"?) is the vector of 
weights that defines the EBLUP for the population total of 
the y, under (12), z.e., 


EBLUP 


BLUR (yA) 1 + {HX +H, X; 2a dyn 
where 1, (1,_,,) denotes the unit vector of size n (N — n) 
and H, was defined in section 2.2.1. In this case pseudo- 
linearisation based estimation of the area-specific MSE of 
the MBDE is carried out using (8), with weights defined by 
(19). Note that the estimated expected values used in (8) 
when applied to the MBDE are the same as the unshrunken 
estimates (14) used with the EBLUP, reflecting the fact that 
both the MBDE and the EBLUP are based on the same 
linear mixed model (12). However, the MBDE weights (19) 
are not locally calibrated, and so the squared bias term in (8) 
cannot be ignored when estimating the MSE of this 
predictor. Furthermore, since 


JMBDE 


MBDE 
= pee ane 
Wy ee Wi 0 


for h # i, we have 6, = 0 for the MBDE and so the bias 
estimator (9) works well in this case. 


Statistics Canada, Catalogue No. 12-001-X 


2.2.4 MSE estimation for the M-quantile estimator 


The third estimator that we consider is based on the M- 
quantile modelling approach described in Chambers and 
Tzavidis (2006). This approach does not assume an under- 
lying linear mixed model, relying instead on characterising 
the relationship between y, and x, in area 7 in terms of the 
linear M-quantile model ‘that cet ‘fits’ the sample y, 
values from this area. That is, this approach replaces (12) by. 
a model of the form 


= XB(q,) +e; (20) 


where B(q) denotes the coefficient vector of a linear model 
for the regression M-quantile of order q for the population 
values of Yand_X, and g, denotes the M-quantile coefficient 
of area i. Given an estimate g, of q,, an iteratively re- 
weighted least squares (IRLS) algorithm is used to calculate 
an estimate 


B(g,)={X'W,(G)X,)_ X1WG,) y, (21) 


of B(g,) in (20), and a non- pus value of y, in area iis 
then predicted by ), = x; "B(q,). Here W, ce is the 
diagonal matrix of final ere used in the IRLS algorithm. 

Tzavidis, Marchetti and Chambers (2010) note that value 
of the M-quantile estimator suggested in Chambers and 
Tzavidis (2006) can be interpreted as the expected value of 
Y in area 7 with respect to a biased estimator of the 
distribution function of this variable in the area. They 
therefore develop an improved M-quantile estimator, re- 
placing this biased distribution function estimator by the 
Chambers and Dunstan (1986) distribution function esti- 
mator under the area-specific model (1). This corresponds to 
predicting m, by 


Pi ew, Veer) Je (22) 


+(1—Ny 0 )W,(G,) X{XEW, GX} Fi — Xs): 

Here X,, and X,. are the vectors of sample and non-sample 
means of the x, in area i. It is not difficult to show that the 
weights flowin (22) are locally calibrated. Furthermore, 
if we then put fl, = x; "BG, ), where B(G,) is defined by 
(21), it is easy to see that (9) is zero and so the area-specific 
MSE of the bias-corrected M-quantile estimator (22) can be 
estimated using just the estimated prediction variance 
component (7). Since the constant ih, in (7) is typically very 
close to one under M-quantile estimation, we set it equal to 
this value whenever we compute values of (7) that relate to 
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small area estimation (SAE) under the M-quantile mod- 
elling approach. 

As we have already done with the EBLUP, we note that 
use of (7) implicitly treats the weights defining (22) as fixed, 
which is actually not the case since the matrix W,(g,) is a 
function of the sample values of Y. An immediate 
consequence is that pseudo-linearisation based estimation of 
the MSE of the M-quantile predictor via (7) is a first order 
approximation to the true MSE of this estimator. Never- 
theless, since accounting for weight variability in the defini- 
tion of the M-quantile estimator considerably complicates 
estimation of its MSE - see Street, Carroll and Ruppert 
(1988) for an examination of this issue in the context of 
‘standard’ M-estimation of regression coefficients - it is of 
interest to see how the relatively simple estimator (7) 
performs when used to estimate this MSE. 


2.3. MSE estimation for the pseudo-linear synthetic 
EBLUP 


In many SAE applications there are areas that contain no 
sample, and hence synthetic estimation is used. Although 
such estimators do not fit into the class of pseudo-linear 
estimators considered in this paper, the ideas behind the 
conditional MSE estimator (8) can be applied here as well. 
To see this, assume that these areas are numbered last, i.e., if 
D* areas have non-zero sample then n, > 0 for h < D* 
and n,=0 for h>D*. For i> D* the ‘synthetic 
EBLUP’ for m, is 


~SYN-EBLUP _ —TQ@EBLUE _ ;...SYN-EBLUP\T 

Mm, ia x; B u (wr, ) Vs 
ms ee y wSYN-EBLUP 73 
Rp Lae Loves nls Be (23) 


where 


SYN-EBLUP __ SYN-EBLUP) _ yyT= 
: =(w; )=H-X,. 


T 


w 


Clearly (23) is a pseudo-linear estimator, and so we can use 
(7) to estimate its prediction variance, observing that since 


Tie eae Niwa and so (7) becomes 


> ¢~ SYN-EBLUP 
V (mM: ) 


wires OG Gi oetce iif ae Nein } se (y, 4, fi, ye (24) 


y 


Unfortunately, since there is no sample in area i, we cannot 


use (9) to estimate the area-specific bias (2) of °° TPEVP, 


However, under the linear mixed model (12), this bias has 


expected value 


> SYN-EBLUP 
E(m: = 


ii) = 


1 


Dt SYN-EBLUP ( _T ji 5 =f 
Dee Wi (xjB+z; u, ) x; B Sj U;. 
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The conditional expectation of the square of this expected 
bias, given the area effects u, = (u,;h =1,..., D*) for the 
sampled areas, is 


2 (_.~,SYN-EBLUP 
E{E? (mi —m,)|X, u,} = 


r 
D* SYN-EBLUP ; _T Bi aTal* for = 
[Dodie TB + oT) —27B) +2703, 


which immediately suggests that for a non-sampled area i 


we estimate the squared bias of the synthetic estimator 
Hye NEBLUP by 


L 


B? ee = 


i 
DY SYN-EBLUP T QR EBLUE T~ ) 
| , 
Dee nme 7a, 


Ld beer \ +7/ Oz. (25) 
Here a, is the ‘unshrunken’ estimated effect for sampled 
area h — see (14). Our proposed MSE estimator for 
Dia we a is then the sum of (24) and (25). Note that, 
unlike (8), this MSE estimator includes no information from 
area i, and so is not an estimator of the area-specific MSE of 
(23). In particular, its validity depends completely on the 
mixed model (12) holding, and so it is not robust to 
misspecification of this model. 


3. Simulation studies of 
the proposed MSE estimator 


In this section we describe results from five simulation 
studies that aim at assessing the performance of the ap- 
proach to conditional MSE estimation described in the 
previous section. Three of these studies are model-based 
simulations, with population data generated from the linear 
mixed model (12). The remaining two are design-based 
simulations, with population data derived from two real 
survey datasets where linear SAE is of interest. 

Given our focus on bias-robustness, the main perfor- 
mance indicator for an MSE estimator in all five studies is 
its median relative bias, defined by 


RBC) = median {M,"K' eee —M, | x 100. 


Here the subscript 7 indexes the small areas and the subscript 
k indexes the K Monte Carlo simulations, with M, de- 
noting the simulation & value of the MSE estimator in area i, 
and M, denotes the actual (i.e., Monte Carlo) MSE in area 
i. Since we would naturally prefer to use the more stable of 
two approximately unbiased MSE estimators, we also 
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measured the stability of an MSE estimator by its median 
percent relative root mean eae error, 


M,,-M, 
RRMSE(M) = median | es a a} x 100. 


M; 


Although the purpose of this paper is not to compare 
different methods of SAE, it is useful to relate MSE 
estimation performance for a particular method of SAE to 
the actual estimation performance of this method. We 
therefore provide two measures of the relative performance 
of the SAE methods that were used in our simulations. 
These are the median percent relative bias 


RB(m) = median |im,'K-" (ri, ~ my )| 100 


and the median percent relative root mean squared error 
RRMSE(m) = median}, K™ Ei a | x 100 
; 7 Mix 


of the estimates m, generated by an estimation method. 
Note that m, = K 'S&,m,, here. 


3.1 Model-based simulations 


The first model-based simulation study was based on 
population data generated under the mixed model (12) with 
Gaussian random effects. It used a population size of NV = 
15,000, with D=30 small areas. Population sizes in the 
small areas were uniformly distributed over the interval 
[443, 542] and kept fixed over simulations. At each simu- 
lation, population values for Y were generated under the 
random intercepts model y, =500+1.5x, +u; +e, 
with x, drawn from a chi- squared emoaton ite 20 
degrees of freedom. The area effects u; and individual 


effects i, were independently drawn from N(0, o,) and 
N(0, o 2) distributions respectively, with the values of o, 
and o shown in rows SIMI-A and SIM1-B of Table 1. A 
sample of size n = 600 was selected from each simulated 
population, with area sample sizes proportional to the fixed 
area populations, resulting in a median area sample size of 
n, = 20. Sampling was via stratified random sampling, with 
the strata defined by the small areas. A total of K = 1,000 
simulations were carried out. 

Conditions for the second model-based simulation study 
were the same as in the first, with the exception that the area 
level random effects and the individual level random effects 
were independently drawn from mean corrected chi-square 
distributions respectively. The corresponding values of the 
area level and individual level variances are shown in rows 
SIM2-A and SIM2-B in Table 1. Finally, in the third model- 
based simulation study conditions were kept the same as in 
SIM1-A and SIM1-B for areas 1-25, but in areas 26-30 the 
area effects were independently drawn from a normal 
distribution with a larger variance. We refer to this as a 
Mixture in Table 1, with variances for areas 1-25 shown in 
rows SIM3-A and SIM3-B, and variances for areas 26-30 
shown in rows SIM3-A* and SIM3B*. Our objective in this 
third simulation was to investigate the behaviour of the 
different methods of MSE estimation for ‘outlier’ areas, and 
so we show values relating to areas 1-25 and 26-30 
separately in Tables 2 and 4. We also replicated all three 
scenarios above using a reduced overall sample size of 
n =150 (with median area sample size n, = 5). These 
additional simulations allowed us to investigate the effect of 
reduced sample sizes on the performance of the MSE 
estimators. 


Table 1 
Parameter values used in model-based simulations 
Type Simulation o 
Gaussian SIM1-A 
SIM1-B 
Chi-square SIM2-A 
SIM2-B 4.0 
Mixture (areas 1-25) SIM3-A 
SIM3-B 
Mixture (areas 26-30) SIM3-A* 
SIM3-B* 


Ge p=02(c2 +7)" 
94.09 0.1 
94.09 0.3 
10.0 0.1667 
10.0 0.2857 
94.09 0.10 
94.09 0.30 
94.09 0.7051 
94.09 0.7051 
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Median relative biases RB(m) and median relative root mean squared errors RRMSE(m) of estimators of small area means in model- 


based simulations 


Weighting Method Simulation 
SIM1-A SIM1-B__ SIM2-A SIM2-B SIM3-A SIM3-B SIM3-A* SIM3-B* 
RB(~), median n; = 20 
Regression 0.005 0.005 0.000 0.000 0.004 0.004 0.006 0.006 
EBLUP, (13) 0.005 0.006 0.004 -0.002 0.004 0.005 0.006 0.005 
MBDE, (18) 0.006 0.006 0.005 -0.008 0.007 0.007 0.001 0.001 
M-quantile, (22) 0.009 0.008 -0.002 0.002 0.015 0.015 -0.013 -0.013 
RRMSE(), median n; = 20 
Regression 0.40 0.40 0.13 0.13 0.40 0.40 0.41 0.41 
EBLUP, (13) 0.35 0.38 0.12 0.13 0.37 0.38 0.45 0.42 
MBDE, (18) 0.55 0.55 0.41 0.43 0.56 0.56 0.55 0.55 
M-quantile, (22) __ 0.41 0.41 0.13 0.13 0.41 0.41 0.36 0.36 
RB(~), median n; =5 
Regression -0,002 -0.003 -0.001 0.002 -0.003 -0.004 0.011 0.011 
EBLUP, (13) 0.001 0.005 -0.002 0.003 0.002 -0.001 0.008 0.011 
MBDE, (18) -0,002 -0.002 -0.005 0.004 -0.001 -0.002 -0.002 -0.002 
M-quantile, (22) -0.001 -0.001 -0.001 0.001 -0.003 -0,003 0.014 0.014 
RRMSE(~™), median n,; =5 

Regression 0.81 0.81 0.26 0.26 0.82 0.82 0.80 0.80 
EBLUP, (13) 0.53 0.69 0.19 0.22 0.61 0.71 1.00 0.87 
MBDE, (18) 1513 eS 0.83 0.83 ills} 1.13 es isl) 
M-quantile, (22) 0.81 0.81 0.26 0.26 0.81 0.81 0.80 0.80 


Table 2 shows the median bias RB(m) and median 
relative root mean squared error RRMSE(m) of the SAE 
methods investigated in our simulations for the two sample 
sizes (n = 600 and 150). These are the synthetic regression 
estimator (see Rao 2003, page 136), the EBLUP with 
weights defined by (13), the MBDE with weights defined 
by (18) and the M-quantile estimator defined by the weights 
(22). The differences between the various SAE estimators in 
Table 2 are essentially as one would expect. Bias is not 
really an issue (to be expected given the population data 
follow a linear model in all cases), while for Simulation 
scenarios 1 and 2 the indirect estimator (EBLUP) is the 
most efficient in terms of RRMSE. The M-quantile 
estimator is the best performer for SIM3-A* and SIM3-B* 
with n, = 20 but its difference from the regression 
synthetic estimator reduces for the scenario with the smaller 
area-specific sample sizes. Note that in this case the M- 
quantile weights (22) are based on an outlier-robust estimate 
of the M-quantile coefficient g, for area i, defined by the 
median (rather than the mean) of the M-quantile coefficients 
of sampled units in this area. Further, as the sample sizes 
decrease, the RRMSEs of all estimators increase, but their 
relative performances remain the same. Under normality the 
EBLUP is better than the M-quantile estimator but the 
differences between these two estimators become smaller as 
we move away from normality, with the M-quantile esti- 
mator more efficient in the mixture model scenarios. 

Table 3 sets out the various MSE estimators investigated 
in our simulations that are based on the approach proposed in 
this paper. These are collectively referred to as “conditional” 


MSE estimators below. In Table 4 we show the perfor- 
mances of MSE estimators for the small area estimators 
considered in Table 2. Note that in addition to the 
conditional MSE estimators, we provide results for three 
other MSE estimators for the EBLUP, with PRO denoting 
the estimator suggested by Prasad and Rao (1990), see Rao 
(2003, section 6.2.6). It is noteworthy that PRO is not an 
estimator of the area-specific MSE of the EBLUP, but of its 
MSE under the mixed linear model (12), i.e., averaged over 
possible realisations of the area effect. In contrast, the MSE 
estimators PR1 and PR2 in Table 4 are the area-specific 
versions of PRO suggested in Rao (2003, section 6.3.2 
expressions 6.3.15 and 6.3.16 respectively). Finally, we note 
that the MSE estimator of the synthetic regression estimator 
that we used in our simulations is its variance estimator 
based on a fixed effects population regression model. We 
denote it by VReg. 

The results set out in Table 4 focus on the median biases 
RB(M) and median relative root mean squared error 
RRMSE() of the various MSE estimators. Not surprisingly, 
given that all its underlying assumptions are met, the PRO 
estimator and its area-specific alternatives, PR1 and PR2, 
perform very well in both normal scenarios (SIM1-A and 
SIM1-B) and both chi-squared scenarios (SIM2-A and 
SIM2-B), with virtually no bias (m, = 20) or small bias 
when within area sample sizes are very small. For the MSE 
estimator of the synthetic regression estimator, on the other 
hand, we see substantial relative bias under all simulation 
scenarios. 
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Table 3 
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Definitions of conditional MSE estimators for different weighting methods 


eR 


Weighting Method Definition of fi;,j ei MSE Estimator 
EBLUP (13) (14) (8) 
MBDE (18) (14) (8) 

M-quantile (22 x1 B(q;) (7) with A; =1 
Synthetic EBLUP (23) (14) (24) + (25) 


Table 4 

Median relative biases RB(M) and median relative root mean squared errors RRMSE(\) for MSE estimators in model-based simulations 

Weighting MSE Simulation 

Method Estimator SIM1-A  SIMI-B — SIM2-A__SIM2-B SIM3-A _SIM3-B SIM3-A* —SIM3-B* 

RB(M), median n; = 20 

Regression VReg 7.59 21.82 11.81 20.78 23.66 34.27 23.97 34.64 

EBLUP, (13) PRO -0.83 -0.72 0.56 1.16 3.44 0.71 -15.65 -6.51 
PRI -0.97 -0.72 0.64 1.08 2.94 0.56 -13.70 -5.81 
PR2 -0.92 -0.72 0.64 1.16 3.20 0.61 -14.65 -6.19 
Conditional 3.89 -0.89 3.06 0.93 -0.05 -0.54 -2.56 -1.59 

MBDE, (18) Conditional -0.81 -0.80 -0.06 -0.42 -0.75 -0.75 -0.98 -0.98 

M-quantile, (22) Conditional -3.10 -1.66 -0.09 -1.90 =), 04) -3.17 11.26 11.04 

RRMSE(), median n, = 20 

Regression VReg 18 ail 30 58 Sf) 85 60 86 

EBLUP, (13) PRO 12 7 15 10 itil 7 29 14 
PRI 14 7 Vy 11 10 7 27 13 
PR2 12 i 16 10 11 Wl 28 13 
Conditional 62 31 70 49 31 30 42 BZ 

MBDE, (18) Conditional 70 70 126 128 71 al 67 67 

M-quantile, (22) Conditional 32 34 49 48 ail 32 48 48 

RB(Y), median n; =5 

Regression VReg Sas) ORG? 10.35 IO) 12 20.92 30.91 2293 33.00 

EBLUP, (13) PRO Beall -0.20 2.42 1.19 12.79 3.86 -30.64 -15.92 
PRI 3.04 -0.50 2:13 1.00 10.84 3.10 -25.77 -13.62 
PR2 3.16 -0.31 2.311 We llil 11.81 3.48 -28.16 -14.77 
Conditional SHeo2 4.38 24.11 8.93 8.18 1.50 -0.66 -0.68 

MBDE, (18) Conditional -0.24 -0.21 0.02 -0.09 -0.62 -0.33 1.29 1.24 

M-quantile, (22) Conditional -7.60 -6.17 55/0 5.00 -5.95 -5.60 5.89 3.60 

RRMSE(™), median n,; =5 

Regression VReg 117 46 33 51 54 78 Se) 83 

EBLUP, (13) PRO Sil 14 33) DP) 36 16 58 31 
PRI 48 18 44 28 34 16 48 29 
PR2 36 5 36 24 34 US 50 29 
Conditional 234 81 193 121 86 66 86 70 

MBDE, (18) Conditional 79 79 1338} 129 79 79 83 83 

M-quantile, (22) Conditional 62 63 90 97 63 63 122) 102 


The conditional MSE estimator for the EBLUP shows 
positive bias under both the normal (SIM1A) and chi- 
squared (SIM2A) scenarios, particularly for moderate intra- 
cluster correlation (3.89% and 37.52% for the normal 
scenario with 20 and 5 units in each area respectively and 
3.06% and 24.11% for the chi-squared scenario with 20 and 
5 units in each area respectively). This bias increases with 
decreasing sample size. However, things change when we 
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examine the results for the outlier components of the 
mixture model scenarios (SIM3-A* and SIM3-B*). Here we 
see a substantial negative bias for all three versions of PR 
(ranging from -30.64% to -5.81% depending on the area 
sample sizes). In comparison, the conditional MSE 
estimator for the EBLUP now shows a smaller negative bias 
(-2.56% and -0.66%) while the same MSE estimator applied 
to the M-quantile estimator shows an upward bias. The 
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conditional MSE estimator for the MBDE is essentially 
unbiased. Given that as far as MSE estimation is concerned, 
positive bias is preferable to negative bias, it seems clear 
that the proposed conditional MSE estimator is better able to 
handle this outlier situation. Figure 1 graphically illustrates 
this point for sample size n = 600. Here we show the area- 
specific RMSEs and the average (over the simulations) of 
the estimated RMSEs in each of the 30 areas for the mixture 
simulations SIM3-A and SIM3-A*, with the vertical line 
delineating the five ‘outlier’ areas. In the top panel of this 
plot we see that the PRO estimator is unable to detect the 
step increase in the MSE of the EBLUP for these ‘outlier’ 
areas, being biased slightly high in the ‘well-behaved’ areas 
and then biased rather low in the ‘outlier’ areas. In contrast, 
the conditional MSE estimator for the EBLUP and the 
MBDE tracks the area specific RMSEs rather well, while 
the same MSE estimator based on M-quantile weights tends 
to be biased low in the ‘well-behaved’ areas, and biased 
high in the ‘outlier’ areas, which can be argued as being 
perhaps a rather better outcome than that recorded by the 
PRO estimator in this simulation. It should be noted here that 
in certain circumstances an assumed model can be revised 
after outlier detection. However, this requires a sufficiently 
large number of detected outliers to permit their separate 
modelling. This is unlikely to happen in practice. Also, 
particular care must be taken with extrapolation of these 
results to the case of very small area sample sizes because of 
the instability that the conditional MSE estimator can 
exhibit in this case. 

Table 4 also shows the relative RMSEs of the different 
MSE estimators across the three types of model-based 
simulation. Here we see that all three versions of the PR 
estimator of the MSE of the EBLUP are more stable than 
the conditional MSE estimator of the EBLUP (12% for PR 
vs. 62% for the conditional MSE for SIM1-A with n, = 20 
and 31% for PR vs. 234% for the conditional MSE for 
SIMI-A with n, = 5). These differences decrease under 
scenarios SIM3-A* and SIM3-B*, however, although the 
PR MSE estimator remains more stable (13% for PR vs. 
32% for the conditional MSE estimator for SIM3-B* with 
n, = 20 and 29% for the PR MSE estimator vs. 70% for 
the conditional MSE estimator for SIM3-B* with n, = 5). 
The same is true for the conditional MSE estimators of the 
MBDE and the M-quantile estimators. Essentially, given 
sample data that follow a mixed linear model, the PR MSE 
estimator of MSE is very stable, while the conditional MSE 
estimator is more variable. 

In summary, although all methods of MSE estimation 
that we evaluated exhibited some bias for very small area 
sample sizes, our model-based simulation results provide 
evidence that for larger area sample sizes the conditional 
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MSE estimation method (8) is bias robust when applied to 
the three pseudo-linear small area estimators EBLUP, 
MBDE and M-quantile. For very small area sample sizes its 
bias robustness is less evident. As one might expect, the 
model dependent ‘area-averaged’ MSE estimator PRO for 
the EBLUP exhibits bias under model failure. The fact that 
we observed rather similar behaviour for the area-specific 
versions PR1 and PR2 of this MSE estimator indicates that 
‘area specific’ does not necessarily mean ‘bias robust’. In 
particular, the fact that PR1 and PR2 behave very similarly 
to PRO may be because the area-specific components of 
PRI and PR2 are of lower order and all three MSE 
estimators have the same leading term, which is not area- 
specific. Our results also show that the conditional MSE 
estimator (8) is much more variable than the model 
dependent PR MSE estimator, even for moderate area 
sample sizes. 


3.2 Design-based simulations 


What happens when, as in real life, we cannot be 
confident that our data follow a linear mixed model? In 
order to investigate this situation, we report results from two 
design-based simulation studies, both based on realistic 
populations, where a linear model assumption is essentially 
an approximation. The first involved a sample of 3,591 
households spread across D =36 districts of Albania that 
participated in the 2002 Albanian Living Standards Mea- 
surement Study. This sample was bootstrapped to create a 
realistic population of N =724,782 households by re- 
sampling with replacement with probability proportional to 
a household’s sample weight. A total of K = 1,000 inde- 
pendent stratified random samples were then drawn from 
this bootstrap population, with total sample size equal to that 
of the original sample and with districts defining the strata. 
Sample sizes within districts were the same as in the original 
sample, and varied between 8 and 688 (with median district 
sample size equal to 56). The Y variable of interest was 
household per capita consumption expenditure (HCE) and Y 
was defined by three zero-one variables (ownership of tele- 
vision, parabolic antenna and land). The aim was to estimate 
the average value of HCE for each district. In the original 
2002 survey, the linear relationship between HCE and the 
three variables making up X was rather weak, with very low 
predictive power. In particular, only ownership of land was 
significantly related to HCE at the five percent level. This fit 
was considerably improved by extending the linear model to 
include random intercepts, defined by independent district 
effects. These explained approximately 10 per cent of the 
residual variation in this model. 
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RMSE 
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Figure 1 Area specific values of true RMSE (solid line) and average estimated RMSE (dashed line) obtained in the mixture-based 
simulations SIM3-A and SIM3-A*. Values for the PRO estimator are indicated by A while those for the conditional estimator 
are indicated by o. Plots show results for the EBLUP (top), MBDE (centre) and M-quantile (bottom) estimators. Vertical line 
separates areas 26-30 with ‘outlier’ effects from ‘well-behaved’ areas 1-25. Total sample size is 600 with area-specific sample 
sizes equal to 20 
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The second design-based simulation study was based on 
an ‘outlier free’ version of the population of Australian 
broadacre farms that was used in the simulation studies 
reported in Chambers and Tzavidis (2006) and Chandra and 
Chambers (2009). In particular, this population was defined 
by bootstrapping a sub-sample of 1,579 ‘non-outlier’ farms 
that participated in the Australian Agricultural and Grazing 
Industries Survey (AAGIS) to create a population of N = 
78,072 farms by re-sampling from the original AAGIS 
sample with probability proportional to a farm’s sample 
weight. The small areas of interest in this case were the 
D =28 broadacre farming regions represented in this sub- 
sample. The design-based simulation was carried out by 
selecting K = 1,000 independent stratified random samples 
from this bootstrap population, with strata defined by the 
regions and with stratum sample sizes defined by those in 
the original AAGIS sample. These sample sizes vary from 6 
to 117, with a median region sample size of 53. Here Y is 
Total Cash Costs (TCC) associated with operation of the 
farm, and X is a vector that includes farm area (Area), 
effects for six post-strata defined by three climatic zones and 
two farm size bands as well as the interactions of these 
variables. In the original AAGIS sample the relationship 
between TCC and Area varies significantly between the six 
post-strata, with an overall Rsquared value of approximately 
0.46 after the deletion of two outliers. The fixed effects in 
the prediction model were therefore specified as corre- 
sponding to a separate linear fit of TCC in terms of Area in 
each post-stratum. Random effects (necessary for computa- 
tion of the EBLUP and the MBDE, but not the M-quantile 
predictor) were defined as independent regional effects (i.e., 
a random intercepts specification) on the basis that in the 
original AAGIS sample the between region variance 
component explains about 3 per cent of the total residual 
variability with the two outliers removed. The aim was to 
estimate the regional averages of TCC. 
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Tables 5 and 6 show the median relative biases and the 
median relative RMSEs of different estimators and corre- 
sponding estimators of the MSEs of these estimators based 
on the K = 1,000 independent stratified samples taken from 
the Albanian and AAGIS populations respectively. It is 
noteworthy that in spite of the fact that the linear mixed 
models fitted to both the Albanian and AAGIS data appear 
reasonable, the gains from adoption of SAE methods based 
on them do not lead to substantial improvements in effi- 
ciency given the original regional sample sizes for these 
surveys. On the other hand, the M-quantile estimator, which 
is not based on a random effects specification, works well 
both in terms of bias and MSE for the AAGIS population in 
this case (Table 6, Median n, = 53), while the EBLUP, al- 
though the best performer in terms of MSE for the Albanian 
population (Table 5, Median n, =56), also records the 
highest biases (albeit still small, with the largest less than 
2%) for both populations given the original area sample 
sizes. The survey regression estimator performs well, al- 
though for both populations there are indirect estimators that 
perform somewhat better. Design-based simulations based 
on the Albanian and AAGIS populations were also carried 
out using smaller area sample sizes than in the original 
surveys. In particular, the overall sample size was reduced 
for the Albanian population to n= 291 (with a median 
district sample size of 9). Similarly, the overall sample size 
was reduced for the AAGIS population to n =233 (with a 
median regional sample size of 8). As expected the RMSE 
of the point estimators increases as the area sample sizes 
decrease. Overall, the EBLUP improves its RMSE perfor- 
mance relative to all other estimators given these smaller 
sample sizes. However, since the realism of these reduced 
sample size designs is somewhat questionable, we do not 
place too much emphasis on results derived from them, 
noting only that they are useful for assessing the perfor- 
mance of MSE estimators with realistic data and with very 
small sample sizes. 


Table 5 

Performances of estimators of regional means and their MSE estimators — Albanian household population 

Weighting Method Median n; = 56 Median n; =9 
Estimator RB(m) RRMSE(m) RB(m) RRMSE(m) 
Regression 0.04 6.25 -0.13 16.56 
EBLUP, (13) 0.42 5.90 1.62 12.42 
MBDE, (18) 0.03 6.14 0.04 16.92 
M-quantile, (22) 0.04 6.07 -0.05 16.60 
Method/MSE RB(M) RRMSE(¥) RB(Y) RRMSE() 
Regression /VReg 17.6 42 LE? 42 
EBLUP/PRO 14.6 44 10.5 50 
EBLUP/PR1 14.4 43 8.8 48 
EBLUP/PR2 14.5 43 Sy if 48 
EBLUP/Conditional 0.1 24 Wt 99 
MBDE/Conditional -0.8 IS; -5.5 64 
M-quantile/Conditional 2:8) 27 -2.0 wD 
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Table 6 

Performances of estimators of regional means and their MSE estimators —- AAGIS farm population 

Weighting Method Median n; = 53 Median n; = 8 
Estimator RB(m) RRMSE(m) RB(m) RRMSE(m) 
Regression 0.03 13.36 0.08 29.83 
EBLUP, (13) 1.64 13.53 0.92 25.82 
MBDE, (18) -0.73 14.26 -1.02 Side, 
M-quantile, (22) -0.04 11.68 -0.15 Syd PH) 
Method/MSE RB(Y) RRMSE(Y) RB(M) RRMSE(M) 
Regression /VReg 74.1 406 54.7 867 
EBLUP/PRO 22.4 Esk GET, 374 
EBLUP/PRI OFS 137 19.0 367 
EBLUP/PR2 21.0 123 Syfoil 444 
EBLUP/Conditional 55) 132 17.8 255 
MBDE/Conditional -0.5 181 0.9 318 
M-quantile/Conditional -0.7 69 -1.9 212 


Focusing on the simulation results obtained using the 
original regional sample sizes, we see that all three PR- 
based MSE estimators for the EBLUP display a substantial 
upward bias in both sets of design-based simulations as well 
as larger (Albanian population, Table 5) or comparable 
(AAGIS population, Table 6) instability to the conditional 
MSE estimators. For the Albanian population all three 
versions of the conditional MSE estimator are essentially 
unbiased whereas for the AAGIS population all three 
versions of the conditional MSE estimator display small or 
moderate bias. 

It is noteworthy that for the Albanian population (Table 
5) the relative performances of the PR MSE estimators 
improve with smaller samples. However, this is because the 
conditional MSE estimators then become more unstable. 
For these very small area samples the conditional MSE 
estimator is less biased than the PR MSE estimator (7.7% 
vs. 10.5%) but is also more unstable (RRMSE of conditional 
MSE estimator is 99% vs. 50% for the PR MSE estimator). 
This is, however, not the case for the AAGIS population 
with median n, =8. In this case, the PR-based MSE esti- 
mators perform badly, with the conditional MSE estimators 
being both less biased and more stable. 

The MSE estimator of the regression estimator exhibits 
moderate or high bias for both populations and all simula- 
tion scenarios. For the Albanian population it appears to be 
competitive to the other MSE estimators in terms of 
RRMSE but for the AAGIS population it is clearly less 
stable than the other MSE estimators. Finally, the condi- 
tional MSE estimator of the M-quantile estimator performs 
well with small relative bias and good stability for all sim- 
ulation scenarios and both populations with the exception of 
the Albanian population with median n, =9 where its 
RRMSE is 75%. 

An insight into the reasons for these differences in 
behaviour can be obtained by examining the area specific 
RMSE values displayed in Figure 2 for the Albanian 
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population and in Figure 3 for the AAGIS population. Note 
that in both cases the sample sizes are those from the orig- 
inal surveys. Thus, in Figure 2 we see that all three condi- 
tional MSE estimators track the district-specific design- 
based RMSEs of their respective estimators exceptionally 
well. In contrast, the PRO estimator does not seem to be able 
to capture between district differences in the design-based 
RMSE of the EBLUP. In Figure 3 we see that the condi- 
tional estimator of the MSE of the M-quantile estimator 
performs extremely well in all regions, with the corre- 
sponding estimator of the MSE of the MBDE also 
performing well in all regions except one (region 6) where it 
substantially overestimates the design-based RMSE of this 
predictor. This region is noteworthy because samples that 
are unbalanced with respect to Area within the region lead 
to negative weights under the assumed linear mixed model. 
The picture becomes more complex when one considers the 
region-specific RMSE estimation performance of the 
EBLUP in Figure 3. Here we see that the conditional 
estimator of the MSE of the EBLUP clearly tracks the 
region-specific design-based RMSE of this predictor better 
than the PRO MSE estimator. With the exception of region 6 
(where sample balance is a problem), there seems to be little 
regional variation in the value of the PRO estimator of the 
RMSE of the EBLUP, indicating a serious bias problem. 

As noted earlier, it is not uncommon to want to produce 
an estimate for a small area where there is no sample. In 
such cases, one has to rely completely on the correctness of 
the model specification. In Table 7 we illustrate the 
importance of this assumption by contrasting the estimation 
and MSE estimation performances of the EBLUP for 
sampled areas with that of the Synthetic EBLUP for areas 
where no sample data are available. Two situations are 
shown. The first is a modification of the model-based SIM1- 
A simulation with a small average sample size and with five 
zero sample areas. The second is a similar small sample 
modification of the design-based simulation based on the 
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AAGIS population, with four zero sample areas. It is clear 
that when the model underpinning the EBLUP actually 
holds (i.e., SIM1-A), estimation and MSE estimation (either 
based on PRO, or on the conditional alternative) works well. 
The problem is that when there is some doubt about how 
well this model holds (as in the AAGIS population), then 
the EBLUP can fail, and our estimator of its MSE can also 


RMSE 


167 


fail to identify this problem. This is nicely illustrated by the 
results for the AAGIS population in Table 7 where we see 
that both the PRO and conditional MSE estimators for the 
Synthetic EBLUP completely fail to identify the large 
positive bias of the Synthetic EBLUP and so end up with a 
large downward bias. 
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District level values of true design-based RMSE (solid line) and average estimated RMSE (dashed line) obtained in the design- 
based simulations using the Albanian household population. Districts are ordered in terms of increasing population size. 
Values for the PRO estimator are indicated by A while those for the conditional estimator are indicated by o. Plots show results 
for the EBLUP (top), MBDE (centre) and M-quantile (bottom) estimators 
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Figure 3 Regional values of true design-based RMSE (solid line) and average estimated RMSE (dashed line) obtained in the design- 
based simulations using the AAGIS farm population. Regions are ordered in terms of increasing population size. Values for the 
PRO estimator are indicated by A while those for the conditional estimator are indicated by o. Plots show results for the 
EBLUP (top), MBDE (centre) and M-quantile (bottom) estimators 


Table 7 
Performance of EBLUP and MSE estimators when there are areas with zero sample 
Weighting Method/ Estimator SIM1-A, median n; = 10 AAGIS, median n; = 9 
RB RRMSE(wv RB(w RRMSE(7 
Areas with n; > 0 (13) /EBLUP 0.00 0.52 2.29 24.94 
Areas with n; = 0 (23)/Synthetic EBLUP -0.05 iE25 87.45 96.46 
MSE Estimator RB RRMSEW. RBA RRMSE 
Me Se (13)/PRO je 0.5 tl 29:9) 760 
__(13)/Conditional 0.7 50 23.87 298 
Areas with n; = 0 (23)/PRO -1.8 35 -29.07 601 
(23)/Conditional -3.6 34 -31.45 101 
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4. Conclusions and discussion 


In this paper we propose a bias-robust and easily 
implemented method of estimating the conditional MSE of 
pseudo-linear estimators of small area means (and totals). 
Our empirical results show that this method of MSE 
estimation performs reasonably well in terms of bias when 
used to estimate the model-based MSE and the design-based 
MSE of the three rather different pseudo-linear estimators 
considered in this paper. However, this improved bias 
performance comes at the cost of increased variability. In 
particular, when area sample sizes are very small, we do not 
recommend use of our proposed method of MSE estimation 
for a conditionally biased estimator like the EBLUP. 

The EBLUP is widely used in SAE, and in this context 
the model-dependent MSE estimator PRO for the EBLUP 
suggested by Prasad and Rao (1990) is unbiased when its 
model assumptions are valid (SIM1-A/B and SIM2-A/B in 
our model-based simulations) but is biased in the presence 
of outlier area effects (SIM3-A/A* and SIM3-B/B*). It was 
also the most stable MSE estimator in the model-based 
simulations. However, its area-averaged construction meant 
that it did not track the area-specific MSE of the EBLUP in 
both our design-based simulations, where the correctness of 
the assumed linear mixed model could only be considered 
as approximate. This suggests that our proposed conditional 
MSE estimation method should be considered as an 
alternative to PRO in situations where there is some doubt 
about the correctness of the specification of the small area 
linear mixed model or where the area sample sizes are not 
small. Some idea of what constitutes a small sample size can 
be deduced from the empirical results presented in this 
paper. 

If there is doubt about the validity of the assumed linear 
mixed model, the user could consider estimation based on a 
more widely applicable alternative model, e.g., the M- 
quantile model, or replace the EBLUP by a more outlier- 
robust alternative (Sinha and Rao 2009). In the former case 
the approach that we propose in this paper is currently the 
only analytical approach to MSE estimation, while in the 
latter case it provides an analytic alternative to more 
computationally intensive bootstrap methods of MSE 
estimation. Note however, that for very small area-specific 
sample sizes the bias-robust MSE estimator proposed in this 
paper remains unstable. 

A future line of research could be to compare the analytic 
MSE estimation method proposed in this paper with 
bootstrap-based MSE estimators, e.g., the nonparametric 
bootstrap MSE estimator of the M-quantile estimator 
proposed by Tzavidis, Marchetti and Chambers (2010), and 
the bootstrap MSE estimator for the Robust EBLUP 
estimator proposed by Sinha and Rao (2009). A key issue in 
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this investigation will be to investigate whether alternative 
bootstrap MSE estimators are more stable, especially for 
small area-specific sample sizes. 

The extension of the conditional MSE approach to non- 
linear SAE situations remains to be done. However, since 
this approach is closely linked to robust population level 
MSE estimation based on Taylor series linearisation (as well 
as jackknife estimation of MSE, see Valliant, Dorfman and 
Royall 2000, section 5.4.2), it should be possible to develop 
appropriate extensions for corresponding small area non- 
linear estimation methods. Although the relevant results are 
not provided here, some evidence for this is that the 
conditional MSE estimation method described in this paper 
has already been used to estimate the MSE of the MBDE 
when it is applied to variables that do not lend themselves to 
linear mixed modelling, e.g., those with a high proportion of 
zero values (Chandra and Chambers 2009), and categorical 
variables (Chandra, Chambers and Salvati 2011). More 
recently, the approach has also been used to estimate the 
MSE of geographically weighted M-quantile small area 
estimators in situations where the small area values are 
spatially correlated (Salvati, Tzavidis, Pratesi and Chambers 
2011). It has also been used by Salvati, Chandra, Ranalli 
and Chambers (2010) to estimate the MSE of small area 
estimators based on a nonparametric small area model 
(Opsomer, Claeskens, Ranalli, Kauermann and Breidt 
2008). 

As is clear from the development in this paper, our 
preferred approach to MSE estimation assumes that the 
MSE of real interest is that defined by the area-specific 
model (1). This is in contrast to the usual approach to 
defining MSE in SAE, which adopts an area-averaged MSE 
concept as the appropriate measure of the accuracy of a 
small area estimator. As pointed out by Longford (2007), 
the ultimate aim in SAE is to make inferences about small 
area characteristics conditional on the realised (but 
unknown) values of small area effects, i.e., with respect to 
(1). One can consider this to be a design-based objective (as 
in Longford 2007), or, as we prefer, a model-based ob- 
jective that does not quite fit into the usual random effects 
framework for SAE. In either case we are interested in 
variability that is with respect to fixed area-specific expected 
values. This is consistent with the concept of variability that 
is typically applied in population level inference. 
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Variance estimation under composite 
imputation: The methodology behind SEVANI 


Jean-Francois Beaumont and Joél Bissonnette ! 


Abstract 


Composite imputation is often used in business surveys. The term “composite” means that more than a single imputation 
method is used to impute missing values for a variable of interest. The literature on variance estimation in the presence of 
composite imputation is rather limited. To deal with this problem, we consider an extension of the methodology developed 
by Sarndal (1992). Our extension is quite general and easy to implement provided that linear imputation methods are used to 
fill in the missing values. This class of imputation methods contains linear regression imputation, donor imputation and 
auxiliary value imputation, sometimes called cold-deck or substitution imputation. It thus covers the most common methods 
used by national statistical agencies for the imputation of missing values. Our methodology has been implemented in the 
System for the Estimation of Variance due to Nonresponse and Imputation (SEVANI) developed at Statistics Canada. Its 


performance is evaluated in a simulation study. 


Key Words: Auxiliary value imputation; Composite imputation; Donor imputation; Imputation model; Linear imputation; 


Regression imputation; SEVANI. 


1. Introduction 


Composite imputation is often used in business surveys. 
The term “composite” means that more than a single impu- 
tation method is used to impute missing values for a vari- 
able of interest. The choice of a method over another one 
depends on the availability of auxiliary variables. For 
instance, ratio imputation could be used to impute a missing 
value when an auxiliary value is available: otherwise, mean 
imputation could be an alternative. 

The problem of estimating the variance in the presence of 
a single imputation method has been extensively studied in 
the literature; e.g., two excellent reviews of this topic are: 
Lee, Rancourt and Sarndal (2001) and Haziza (2009). 
Although the use of composite imputation occurs frequently 
in practice, there is little literature on estimating its variance. 
The literature includes a jackknife variance estimator that 
was proposed and evaluated empirically in Rancourt, Lee 
and Sarndal (1993). Sitter and Rao (1997) developed further 
the theory and obtained design-consistent linearization and 
jackknife variance estimators. In both papers, two imputa- 
tion methods were considered, with ratio imputation being 
one of the methods, simple random sampling was used and 
uniform nonresponse was assumed. Later, Felx and 
Rancourt (2001) extended the general methodology pro- 
posed in Sarndal (1992) and Deville and Sarndal (1994) to 
composite imputation using simplifying assumptions. 
Finally, Shao and Steel (1999) developed an interesting and 
general reverse approach to variance estimation to deal with 
composite imputation (see also Kim and Rao 2009). Shao 
and Steel (1999) claimed that their reverse approach leads to 


derivations that are less involved than those found in Deville 
and Sarndal (1994). We do not fully agree with this state- 
ment. Our results indicate that, in general, our extension to 
Sdrndal’s approach actually leads to simpler derivations 
than those obtained with the Shao and Steel approach. The 
reverse approach may however become quite attractive 
when the sampling fraction is negligible and a replication 
variance estimation technique is chosen (see section 7 for 
greater detail). 

We consider the methodology proposed by Sarndal 
(1992) as a starting point. It requires the validity of an 
imputation model; i.ec., a model for the variable being 
imputed. At first glance, the extension of this methodology 
to composite imputation seems to be quite tedious, as noted 
by Shao and Steel (1999), until we notice that most 
imputation methods used in practice lead to imputed esti- 
mators that are linear in the observed values of the variable 
of interest. This considerably simplifies the derivation of a 
variance estimator even when there is a single imputation 
method. For the estimation of the sampling portion of the 
overall variance, we use a methodology (see Beaumont and 
Bocci 2009) that is slightly different than the one proposed 
by Sarndal (1992). This allows us to simplify the derivations 
further. This research has been implemented in version 2 of 
the System for the Estimation of Variance due to Non- 
response and Imputation (SEVANI), which is developed at 
Statistics Canada (see Beaumont, Bissonnette and Bocci 
2010). 

The paper is structured as follows. In section 2, some 
notation is introduced and composite imputation is ex- 
plained. Linear imputation is defined in section 3. Our 
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approach to inference and our main assumptions are de- 
scribed in section 4. In section 5, a number of results are 
stated regarding variance estimation under composite impu- 
tation. Section 6 presents the results of a simulation study 
that assesses the performance of our variance estimator. The 
reverse approach is briefly discussed in section 7 to high- 
light the differences with our approach. Finally, a short 
conclusion is given in section 8. 


2. What is composite imputation? 


Suppose that we are interested in estimating the popu- 
lation domain total 8 = ¥,-yd,y,, where U is the finite 
population of size N, y is the variable of interest and d is a 
domain indicator variable indicating whether population 
unit k is in the domain of interest (d, = 1) or not (d, = 0). 
A sample s of size n is selected from the finite population U 
according to a probability sampling design p(s). In the 
absence of missing values, 8 can be estimated by the 
Horvitz-Thompson _ estimator 6 = Dees A, Vio Where 
w, =1/n, is the design weight and x, is the selection 
probability of unit k. Although it is possible to extend our 
results to calibration estimators, it is not considered in this 
paper to keep matters simple. 

Variable y can be missing for some of the sampled units 
but we assume that the domain indicator variable d is 
always observed for those units. The set of sampled units 
with an observed y-value, called the set of respondents, is 
denoted by s,. It is assumed to have been generated 
according to a nonresponse mechanism q(s,| s). The set of 
nonrespondents is denoted by s,, = s — s,. It is further split 
into J mutually exclusive subsets, s\/, j =1,..., J, such 
that s,, = U%_,s\, if composite imputation with J > 1 
imputation methods is used. All the missing y-values within 
a given subset s‘/ are imputed with the same method /. 
However, different imputation methods are used to impute 
missing values in different subsets. The resulting imputed 
estimator can be expressed as 


6, = mae + De We Vp 


kes, kes, 
J} 
* 
kes, j=lkes? 


m 


where y, is the imputed y-value for unit k. 

Composite imputation is quite frequent in business 
surveys. It is used because there are missing values in 
auxiliary variables used for imputation. To fix ideas, let x, 
be the complete vector of auxiliary variables for unit k. 
Ideally, all the missing y-values would be imputed using a 
single imputation method based on the complete vector x,. 
Unfortunately, there may be missing values in the auxiliary 
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variables so that, for some nonrespondents, we cannot use 
x, to impute their missing y-value; we can only use a 
subset of x,. We denote as x;”, the vector of observed 
auxiliary variables for unit k. This vector does not 
necessarily contain the same observed variables from one 
unit to the next. To impute the missing y-value of a given 
unit k, an imputation method is chosen based on the 
available auxiliary variables x;°. Since there may be a 
number of nonresponse patterns in the complete vector of 
auxiliary variables, the imputation strategy may contain a 


number of imputation methods. 


Example: 


The variance estimation issues raised by composite imputa- 
tion can be better understood by considering the following 
example. Suppose that the complete vector of auxiliary 
variables for unit k is x, = (%,, X5,), where x, is strongly 
related to y, but subject to missing values while x,, is set 
to a constant for all sampled units (x,, = 1, k € s). Ideally, 
x, is used to impute y, if it is missing. If x,, is not avail- 
able, only x,, can be used. Table 1 summarizes the infor- 
mation available for the different subsets of the sample s. 


Table 1 
Available information when there is one auxiliary variable x, 
and a constant x, 


i 


Subsets y x, xX, xoPs 
: a ) 0 0 (x, X2) 
; s?) O M O (M, x) 
F Sm M O O (x, X2) 
mm s2) M M fe) (M, x>) 


O: Observed; M: Missing. 


The set of nonrespondents s,, is divided into the subsets 
s and s depending on the availability of x,. Similarly, 
the set of respondents is divided into subsets s‘ and at 
In this example, we could use ratio imputation to impute 
missing y-values in s“) and mean imputation to impute 
missing y-values in s‘). Note that simple linear regression 
imputation could be used instead of ratio imputation (if it 
better fits the data). We have chosen ratio imputation in this 
example for its simplicity and because it is frequently used 
in business surveys. 

Only the respondents in s can be used to impute 
missing y-values in s“) through ratio imputation. The im- 
puted value for a unit k in s® is y, =x, Dies” @) Y,/ 
Dyes 267s where «{” is some weight used for ratio 
imputation (imputation method 1). Typical choices are: 
os = w, (design-weighted imputation) or @)? =1 (un- 
weighted imputation). For mean imputation, the respondents 
in s as well as those in s“? can be used to impute 


Ie r 
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missing y-values in s‘. In practice, it is common to use 
both sets of respondents to improve the stability of the 
imputed mean. The imputed value for a unit k in si is 


Ve = Dives oy oe Del ® a), 


where @;” is a weight used for mean imputation (impu- 
tation method 2). (Typical choices of @\”) are the same as 
those for w/?; ie, wf? = w, or wo =1.) This implies 
that units in s\” can be contributors to both imputation 
methods. This raises issues for variance estimation of the 
resulting composite imputation estimator. These issues will 
be addressed in section 5. 


(2) 


3. What is linear imputation? 


The imputation method 7 is said to be linear if the 
imputed value y, fora sample unit k € s‘/’ can be written 
in the linear form 


= 08 + oy, (3.1) 


les, 


The quantities jj’ and jj’, for / € s,, are obtained 
without using y-values, but may depend on s and heed Bite 
linear form (3.1) is satisfied by several of the most common 
imputation methods in practice such as (weighted or 
unweighted) linear regression imputation, donor imputation 
and auxiliary value imputation. A nice review of these 
methods is found in Haziza (2009). Note that auxiliary value 
seg does not use the y-values of respondents; i.e., 

=e (see Beaumont, Haziza and Bocci 2011). Bon 
ae imputation, the imputed value ee is equal to the y- 
value of a suitably chosen respondent (donor) so that 
© = 0 and qj) = 0 for all but one respondent / < s.. 
Detailed expressions for @{/) and oi’ are given in the 
Methodology Guide of SEVANI (Beaumont, Bissonnette 
and Bocci 2010), which is available on request from the 
authors. 

Let QF? = Dye ay, be the contribution of impu- 
tation method j to the estimator 6,. Using 3.1), QY can 
be decomposed as follows: 


OP ys wd,V; 


= > RC IRON we iy) oe wd,” 


kes) les, kes? 


= (j) (J) 
=Woy + ois Va 


les, 


(3.2) 


where Wi!) = Dress W 4, §) and Wi) = Dresly) Wd, Ole. 
Using (3.2), the imputed estimator (2.1) can be expressed in 
the linear form: 
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Oa) Wider, + yay 


kes, 


at (+) (+) 
= Wook lene 7° We We» (3.3) 
kes, 
(Gry) is J) (Ce) a; (J) 
where Woy’ = Yi4Wo? and WL? = Wax 


Continuing Sati the example introduced at the end of 
section 2, we observe that for ratio imputation, ¢) = 0 
and oi? = = oO ie Lies, ox, for les, with asi al 
for / € s\. For mean imputation, we have Oo, = 0 and 
OP =a, yy Lies, ;, for J € s,. Consequently, W,") = 


(2) 
0, W2 = 0, 
a) _ a) (1) 
Wy =O; dines) We d, Popel DS a Ox My 


(2) __ ,,@2) vel 

and as ©) Likes?) W, a, UD peg OO; 
Gx (+) om (2) 
Waa = 0 and Wi” = W,’ +W,,’. 


This implies that 


4. Approach to inference and main assumptions 


We consider three sources of variability when evaluating 
expectations and variances of the imputed estimator: the 
variability due to the imputation model, the sampling design 
and the nonresponse mechanism. Note that the use of an 
imputation model to make inference in the presence of 
imputation can be found in Rubin (1987), Hidiroglou (1989) 
and Sarndal (1992). In what follows, we will use the 
subscripts m, p and q to denote the expectations, variances 
and covariances evaluated with respect to the imputation 
model, sampling design and nonresponse mechanism 
respectively. 

We consider the following imputation model to describe 
the relationship between the y-variable and the vector x°" 
of observed auxiliary variables: 


En Ve | x) = Hy, 


m VV | x} = Oo; 
cov. (V, 7, 1X.) =0. 


(4.1) 


for k#/ and k,/ e@U. The population matrix x 
contains the vectors of observed auxiliary variables, x?”’, 

for keU, and pw, and o; are functions of om 
Asymptotically m-unbiased and m-consistent estimators of 
1, and o; are denoted by fi, and 6; respectively. Since 
we will always condition on X°’, we exclude this condi- 
tioning from the notation to simplify it. For instance, 
E,, (| X°") will be written as E,, (y,). 

In model (4.1), we condition on the observed auxiliary 
variables. Since the nonresponse pattern in the vector x is 
not the same for all the nonrespondents, a separate condi- 
tional model must be validated and fitted for each non- 
response pattern. In principle, these conditional models 
should be used to determine the imputation methods chosen. 
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Note that model (4.1) reduces to the standard conditional 
model (e.g., Sarndal 1992) when the vector x of auxiliary 
variables is not subject to missing values. 


Remark: The validity of the variance estimation method in 
section 5 requires 1, and o;, to be correctly specified. Al- 
though a parametric form for 1, may often be acceptable, it 
may be more difficult to determine a suitable parametric 
form for o;. To avoid this issue and obtain some Oe a 
against misspecification of the model variance, o; can be 
estimated non parametrically; see the empirical study of 
Beaumont, Haziza and Bocci (2011) for an illustration of 
this property under auxiliary value imputation. In the 
context of donor imputation, Beaumont and Bocci (2009) 
showed empirically that nonparametric estimation of both 
uu, and o;, via penalized smoothing splines, reduced sig- 
nificantly the vulnerability of our variance estimator to mis- 
specifications of the model mean and variance. 

In addition to the imputation model (4.1), we also assume 
that: 


CY sk ss ocX 0 Zea = (Val Xe, (4.2) 


where F(-) denotes the distribution function, Y and D are 
N-element vectors containing respectively y, and d, as 
their kK” element, and Z is a N-row matrix of design infor- 
mation, which implicitly or explicitly contains information 
about the selection probabilities m, and joint selection 
probabilities 1,,, for k,/ ¢U. This assumption, often 
implicit in other papers, allows us to treat the response 
indicators, the domain indicators and the design information 
as fixed when taking model expectations and variances. A 
careful choice of the auxiliary variables is necessary to 
satisfy this assumption. For instance, the design information 
and the domain indicators should be considered as potential 
auxiliary variables. 

The imputation strategy given in our example started in 
section 2 could be justified by a model with p, = B,x, and 
Oo; = ey eae for eis. and 1, oe ale 
c= O35 for k € s” or k € s\. The model parameters 
B,, B, 0, and o} are unknown. Note that if the x, ’s are 
assumed to be identically distributed random variables with 
mean jt, and variance Gs. then (8, = Py leaud Cas 
6; o. +0; ,. The imputed values ¥, =, for ke se, 
are obtained by estimating the model parameters B, and B, 
from the observed data. For instance, the m-unbiased 
estimators of f, and B, could be chosen as 


ay ae (1) (1) 
‘= yikes Ok Yef D kes) Ok Mik 


Res 


and 


iD: kes? Hop wel DRS s? RES 
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bes eee This would lead to fi, = B, x, for oe eta 
or kes", and fi, =, for k es or k es. As in 
section 2, one could also rare the Potentially more 
efficient estimator B= = Yikes, or Ye or? instead of 
B,. Unfortunately, B; is biased under the model since 


(2) 
Ds ox (%% By, — By) 
A je (a) 
E,, (B3| 5, s,) = By += ———.._ 43) 
2 
kes, 

As pointed out above, if the x,,’s are assumed to be 
identically distributed random variables with mean p, and 
variance o-, 8, =f, , and equation (4.3) can be 
rewritten as 


ES, (B5 | 5, 5.) = B, 
»» Of a ae ape is) 
yer oF 


It can be shown under weak conditions that £,, Gils, is 
s) = ppeO,U //n) so that the model bias of Bae 
peymptouealty negligible. However, since var, 2G sis: ee = 
O,(1/n), the squared model bias is not necessarily 
asymptotically negligible compared to the model variance 
of B.. At least, B, is m-consistent for B,. From (4.3) or 
(4.4), we can see that the model bias of B; can be 
controlled by assigning a SE weight \” to units 
k es‘ relative to units k € s. For instance, one could 
consider Daley Cs erie sand some 
a >0, and wo = w, for k ¢ s. In the extreme case 
where @2 = 0, for k € s\,B, is model-unbiased be- 
cause it is equal to B,. Note that the ay Ba of B; could 
be larger than O,(1 /Jn) if x,, k € 5, have a mean 
different from x,, k € s. Tn such case, controlling the 
model bias of B, might be more important. 

In the case of donor imputation, a fourth source of 
variability needs to be considered when donors are ran- 
domly selected among respondents to impute nonre- 
spondents. In this paper, the subscript g will implicitly 
indicate that moments are evaluated with respect to the joint 
distribution induced by the nonresponse mechanism and the 
random donor selection mechanism. As a result, when 
conditioning on s,, as in (4.2), it should be kept in mind 
that conditioning is not only on the set of respondents but 
also on the set of selected donors. 


(4.4) 


= Wert, 


5. Variance estimation 


Sarndal (1992) expresses the total error of the imputed 
estimator as: 
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6, -8 = (6-6) + (6, — 6), (5.1) 


where the first term on the right-hand side of (5.1) is called 
the sampling error and the second term is called the 
nonresponse error. Using the assumptions given in section 4 
and ie, (6 - 8) = 0, the overall bias of the imputed esti- 
mator reduces to Fg Ores 0) = EB, where B,, 

E_,(6, - -6 | s,s,) is the (conditional) model bias oF the 
imputed estimator. Using (2.1), the model bias can be 


expressed as 


di 
Be = >» Ni Wd, En (VY; 


j=l kesU) 


Hy sss)! (5.2) 
This means that the model bias and the overall bias 
vanish if the model expectation of the imputation error, 
—;, 18 zero, for kes -and j =1,..., J. In princi- 
fe an imputation strategy should be chosen so that this 
condition is satisfied (at least approximately). This is 
typically assumed in the literature (e.g., Sandal 1992: Shao 
and Steel 1999). 
In the example introduced in section 2, the model bias 
(5.2) reduces to 


dir (ai 


(2) 
kes, 


41) En rT B, Is, S,) 


An expression for £,, (B5 — B,|s, s,) is given by (4.3) 
or (4.4). As noted in the paragraph that follows equation 
(4.4), the model Bees, B,,, can be controlled by assigning a 
smaller weight w\”) to units k € s\ relative to units 
k es. It is also small if the number of nonrespondents 
imputed by method 2 is small. Note that our variance (or 
Mean Squared Error, MSE) estimation approach requires 
the slightly weaker assumption that £ ,(8,, | 8) 1s negligible 
(see section 5.3). 

Using (5.1), Sarndal (1992) decomposed the overall 
MSE into three components: 

E (6,- 


mpq 


0)’ = E,,var,(6) + EE, {(6, — 6)*| s, s,} 


pgm 


+ 2E,,E,,{(8, — 6)(6- 8) |s,5,}. (5.3) 


The overall MSE (5.3) becomes approximately equiva- 
lent to the overall variance, var,,, ‘ (0, — 9), when the over- 
all bias is negligible. The first, second and third terms on the 
right-hand side of (5.3) are referred to as the sampling 
variance, the nonresponse variance and the mixed compo- 
nent respectively. The sum of the last two terms can be 
called the nonresponse component since these terms would 
disappear if there were no nonresponse. The nonresponse 
component is simply the difference between the overall 
MSE/variance and the sampling variance. In what follows, 
we develop an estimator for each of these three terms. 
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5.1 Estimation of the sampling variance 


Let v(y) be a p-unbiased estimator of var, (6) that 
would be used under complete response. The typical 
Horvitz-Thompson estimator is 


Ty — Ty, 
V(y) = ed 4,9), 


kes les Ty; 


(5.4) 


where 7,, is the joint selection probability of units k and /. 
In the presence of nonresponse, V5. = v(y,) is the naive 
sampling variance estimator that treats the imputed values 
as true values, where y, is the imputed y-variable; i.e., 
Vex = Yj tor kve sand’ y,, = y,, for kes, 

Sandal (1992) proposed the following mee Pediesea 
estimator of the sampling variance V,,,, = E, , var, (0): 


Vsam = Vorp + ae 


m 


Vorp |S; 8,). Unfortunately, the expression for V,, is usu- 
ally tedious to derive, and it is even more so when compos- 
ite imputation is used. 

Beaumont and Bocci (2009) simplified Sarndal’s 
derivations by conditioning on Y,, the vector containing the 
en y-values. More explicitly, let VS,.=E,,(v(v)— 

Vor sess X)) yand yas be an m-unbiased estimator of 
Vie: chee B bh SSNs: eh N=. Our mpq-unbiased 
sampling variance estimator is Vem = =Vorp + Vee Since 
Vor is a constant when conditioning on s, s, and Y., 
oe can simply be obtained by estimating E,(v(y)|s, 

. Y,). If(5.4) is used, 


where V, pr 1S an m-unbiased estimator of V,,,.=E,, (v(v)— 


E,V(y)|5,8,, ¥,)=vOt)+ >) (1 - 2,) wi d, 02, (5.5) 


k € Sin 


Where y.,= y,, for kes,, and yt, =p,, for kes, 

An estimator Ve sam Of (5.5) is obtained by replacing the 
unknown mean 1, and unknown variance oj in (5.5) by 
m-unbiased (or at least m-consistent) estimators fi, and 6;. 
This estimator is easy to compute provided a software 
package that treats the complete response case is available 
to obtain the first term on the right-hand side of (5.5). The 
general formula (5.5) can be used for every imputation 
strategy. The only difference between different imputation 
strategies lies in the choice of the imputation model and the 
estimators fi, and 6;. 


5.2 Estimation of the nonresponse variance 
An ae -unbiased estimator a the nonresponse variance 


Ve SOE E_{(6, — 6)" | s, S,} 18 obtained by finding an 


Pq m 
m-unbiased estimator of 
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E, (6, — 6)°| s, s,} = var, {(6, — 6) | s, s,} + BZ. (5.6) 


m 


Using 0, defined in the first equation of (3.3), the 
nonresponse error with composite imputation can be 
decomposed into J components: 


dA 
0, == coy? = Q”), 
jal 


where QY? = Y, 2, wd,y,. Each of these J components, 
QY —Q”, is associated with a different imputation 
method. Since y, only involves observed y-values, Q') = 
Dies) Wy y, only involves observed y-values as well and 
thus Q" and Q‘” are independent under the model. 
Therefore, the model variance of the nonresponse error can 
be written as 
J di 
var.{ (0, = 9) | s35, =D cov, (Ome, QY lisr sy) 


m 
i=l j=l 


bf 
+ > var, (Q'” | s, s,). (5.7) 
j=l 


Note that the covariances cov, (Q", QY|s, s,), for 
i# j, are not necessarily negligible because some ob- 
served y-values can be used for more than one imputation 
method. 

The derivations of the model variance (5.7) could be 
quite involved when several imputation methods are used 
because of the non-negligible covariances. The algebra can 
be greatly simplified for linear imputation methods. By 
using the second equation given in (3.3), the nonresponse 
error can be expressed as 


A _§- wo (+), ih 
6,-9 =Wp + Wa Ve — Weer: 


kes, kes, 


(5.8) 


Since the nonresponse error is linear in the y-values, its 
model variance is given by 


var, {(0, ie 0) | ie = yy (VOVG, ta > We d, 2 (5.9) 


kes, kes, 


If the model bias B,, is negligible, an mpq-unbiased 
estimator V,, of the nonresponse variance Vip, is obtained 
by replacing o; in (5.9) by an m-unbiased (and m- 
consistent) estimator 6;. If the model bias is not negligible, 
it can be estimated by an m-consistent estimator B,, and, 
using equation (5.6), the nonresponse variance estimator 
Vn can be replaced by Vipt+ B>. Note that B? is m- 
consistent for B? provided that B, is m-consistent for 
B,. The estimator B, can be found by using (5.8) and 
writing the model bias as 
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Be = E,, 0, - 6 | 5; S,) 
= Woy + YS Woe p= >} Wy Ay Hy: 


kes, kes, 


(5.10) 


The estimator B,, is obtained by replacing 1, in (5.10) 
by an m-consistent estimator f1,. 


5.3 Estimation of the mixed component 


An mpgq-unbiased estimator of the mixed component 


Vex = 2E,,E,, {(6, — 6)(6 — 9) | s,5,} 


Pq m 


is obtained by finding an m-unbiased estimator of 


2E_{(0,— 6)(6 — 6) |s,s.} = 
2cov,, {(6, — 6), (6 -— 0) | s, s,} 
+ 2B E {(6—6)|s, s.}. 


mm 


(5.11) 


Since both the nonresponse error and the sampling error 
are linear in the y-values, using (5.8) we obtain: 


2cov, {(6, — 6)(6 — 8) | 5, s,} = 
2 Ww, —1)d,07 -— 2. w,(w, - d,o;. (5.12) 


kes, kes,, 


If the model bias B, is negligible, an mpg-unbiased 
estimator V,,., of the mixed component Vi,.x is obtained 
by replacing oO; in (5.12) by an m-unbiased (and m- 
consistent) estimator 6. Note that the mixed component is 
not necessarily negligible (Brick, Kalton and Kim 2004) 
and, moreover, it has been found to often be negative in 
practice. 

If the model bias B, is not negligible, it may not be 
possible to easily estimate the second component on the 
right-hand side of (5.11). The reason is that E, {(0-8)|s, 
s,} involves knowing x;’* as well as the domain indicator 
variable d for the nonsampled portion of the population; this 
information may not be available. This problem can be 
bypassed by changing the inferential framework. The full 
multivariate distribution between y, x and d can be modeled 
instead of conditioning on d and x°. We did not 
implement this idea in SEVANI because it leads to a more 
complex modeling task and makes it difficult to obtain a 
general variance expression that is easy to implement. 
Ignoring the second component on the right-hand side of 
(5.11) should not be of great concern in practice when the 
model bias is not too large. In section 5.4, we provide a 
diagnostic that can be helpful for determining whether the 
model bias is important or not. 
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The mixed component can also be written as 


Vinx = 2Ep,E,{(6, -— 6)(6 - 0) | s, s,} 


= 2E,,[cov,, {(8, — 6), (6 — 8) | s,5,+] 
+ 2E.[E,(B,,| )£,, {(6 — 8) | s}]. 


Expression (5.12) can therefore be used to obtain an 
estimator of V\,,, provided that E ,(B,,| 8) is negligible. 
This is a weaker assumption than requiring B to be 
negligible since this assumption is satisfied when sites Be 
or £ aie =f) | s) is negligible. For instance, in our onitts: 
example. B,, may not be negligible but, if d, =1 and 

= @, = w,, E,(6,-6|s)*0 under uniform non- 
eee (see Sitter on Rao 1997), 


5.4 Estimation of the overall MSE/variance 


The overall MSE, or overall variance if the overall bias is 
negligible, 


Voor = Enpg 6-8)" Vsam + Vax + Vuarx 

can be estimated by V5, = Vinee oe Venx if the 
model bias, B,,, is negligible. The bomeconee re 
estimator is oe + Vee From a user’s perspective, the 
estimator V,5, is of eae interest than its individual 
components. A user may nevertheless be interested in the 
estimator of the sampling variance, V&,,, or the ratio 
Vice Vo. The latter estimates the contribution of the 
sampling variance to the overall variance. 

As pointed out in section 5.2, if the model bias is not 
negligible, the nonresponse variance can be estimated by 
Vet B instead of Vp. This leads to the overall MSE 
estimator rae Apr -~e +Vin + B ia) oe 

A statistic that can be useful as a diagnostic to determine 
the magnitude of the model bias is either |B | / Voor of 
IB, / J Vror, apy: A large value of any of these two statis- 
tics may be an indication that the model bias is not negligible 
and that the composite imputation procedure should be ques- 


tioned. The advantage of |B |/,/V. Viorapy Over IB LIV, Lee 


is that it is bounded; .e., 


5.5 Random regression imputation 


A random regression residual e, is sometimes added to 
the regression imputed value y, to preserve the natural 
variability of the y-variable. We suggest that the random 
residuals e, be generated independently with E, (e,| 8, 
s,) =0 and var.(e,|s, s,) = 6;, where the subscript * 
indicates that the expectation and variance are taken with 
respect to the random imputation mechanism. This leads to 
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the imputed value y," = y, +7, e,, with y, =1 if unit k 
has been imputed with a random residual added and 7, = 0 
Senne The imputed estimator (2.1) with y, replaced by 
y,* is imei by 6) = = 6; + Likes, Wy d, H e,. Since 
E.(é | s,s.) = 0, sine a bisa residual does not 
introduce any bias in the imputed estimator. The overall 
MSE of 6; can be expressed as 


.(6,- 0 =E, (6,- 0)2+E 


mpq 


var, (8; | s, s,). (5.13) 


E nog Ee 


The first term on the right-hand side of (5.13) is 
estimated as in section 5.4. The second term is estimated by 


var, (0, | s, s,) = Wed 7G, (5.14) 


kes, 


6. Simulation study 


We conducted a Monte-Carlo simulation study to assess 
the methodology described in section 5. A bivariate popu- 
lation of N = 400 units was generated that contains an 
auxiliary variable x and a variable of interest y. For each 
population unit, the auxiliary variable was generated 
according to a gamma distribution with mean 48 and 
variance 768. The variable of interest y was generated 
conditionally on x from a gamma distribution with mean 
1.5x and variance 16x. Half of the population was 
randomly assigned a missing value to x. As no domain of 
interest was generated, 8 is the overall population total of 
variable y. 

Ten thousand samples were selected from this population 
using simple random sampling without replacement. We 
considered two sample sizes: n = 100 and n = 250. For 
each sample, nonresponse to variable y was generated 
independently from one unit to another with a nonresponse 
probability of 0.3. We used the same imputation strategy as 
in the example in section 2 with |” =1, for / < s“, and 
Oe, for ve se Nonrespondents to variable y with an 
observed x-value were imputed by ratio imputation while 
those with a missing x-value were imputed by mean 
imputation. 

The population y-values were kept fixed throughout the 
replications of the simulation experiment; each replication 
consisted of selecting a sample and then generating 
nonresponse to variable y. If we had strictly followed the 
theoretical development in section 5, we would have 
generated new y-values at each replication according to the 
imputation model. However, it is more common in the 
literature to fix the population y-values when conducting a 
simulation experiment. For instance, our simulation set-up is 
essentially the same as the one discussed in Rancourt, Lee 
and Sarndal (1993), who also considered composite 
imputation. 
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We computed the Monte-Carlo sampling variance and 
overall MSE as VMC=y%,(6,-0)/R and Vior= 

ce _,—9)°/R_ respectively, where the subscript r 
indicates that estimates are computed using the r"" replicate 
and R =10,000. The Monte-Carlo relative bias of any 
estimator of Vo,yj, SAY Vsqy> iS computed as RB(Veayy) = 
ya (eee Vee (VN\,R). Similarly, we computed 
the Monte-Carlo relative bias of an estimator of Vyor, 
denoted as RB(V;;), and the Monte-Carlo relative bias of 
an estimator of Voay/Vror, denoted as RB(Vgay / Vor) - 
Finally, we computed the Monte-Carlo coverage rates of 
confidence intervals for 8 with a 95% confidence level 
assuming that ) , is normally distributed. 

The results of our simulation study are given in table 2. 
In the columns labeled SEVANI, the sampling variance, 
Vs, and the overall MSE, V;5;, are estimated for each 
sample by Vi, and yee apr Tespectively (see section 
5.4). We have also obtained results by replacing Vor. ap; 
by V;o7- We do not report these additional results in table 2 
as they were quite close to those obtained with rer Ee 
This suggests that the model bias B,, is not important in this 
case. In the columns labeled Naive, both the sampling 
variance and the overall MSE are estimated by Vogp (see 
section 5.1). 


Table 2 

Results of the simulation study 

n =100 n =250 
SEVANI Naive SEVANI Naive 

RB(Vogam) 2.82% -17.59% 3.02% -17.68% 
RB(Vgam/Vror) 8.30% - 5.84% - 
RB(Vror) -5.07% -40.68%  -2.66% -52.89% 
Coverage Rate 93.38% 86.20% 94.42% 81.80% 


These results show that the methodology described in 
section 5 and implemented in SEVANI is better than the 
naive variance estimator for the estimation of the compo- 
nents of variance and the construction of confidence 
intervals. The use of SEVANI leads to small Monte-Carlo 
relative biases and coverage rates close to the targeted 
nominal rate (95%). Our methodology is also useful for 
users who would like to estimate the contribution of the 
sampling variance to the overall MSE; ie, Vgay/ Vror- 
Note that VX, / Vigr is 71.98% for n = 100 and 57.23% 
for n = 250. Since VV, / Vijy is not close to 100% even 
for n =100, the effects of nonresponse and imputation 
cannot be systematically ignored when estimating the 
overall MSE. 
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7. The reverse approach 


Shao and Steel (1999) proposed a reverse approach to 
variance estimation developed to deal with composite 
imputation. They assumed that the overall bias is negligible 
and suggested the following decomposition of the overall 
variance: 


E_,,(0,- 9 =E 


mpq 


var, (8, | Ui) 


mq 


4 


fe Bea (Onli )% a0 }e, abd), 
where U,, is a conceptual population of respondents. The 
inner expectation and variance in the right side of (7.1) are 
taken with respect to the sampling design. Unfortunately, 
the imputed estimator 0 , is generally not linear with respect 
to the sampling design even though it is linear with respect 
to the observed y-values. Therefore, the imputed estimator 
6 , is typically linearized (e.g., Shao and Steel 1999; Kim 
and Rao 2009). More explicitly, the quantities @jj) and 
@\) often depend on the sample in a nonlinear way; e.g., 
this is true with linear regression imputation (see the 
example at the end of section 3) and donor imputation. It is 
not always straightforward to account for the sampling 
variability of pf) and oj’ when using (7.1). For example, 
there is no literature on the use of the reverse approach to 
estimate the variance under nearest-neighbour imputation. 
Moreover, since each composite imputation strategy yields 
its own linearized imputed estimator, it is not an easy task to 
implement this methodology in a generalized software 
package. 

Using our approach, the inner expectation in the 
expressions for the nonresponse variance, 


Vox = ado GE — 6)’ |s, 5}, 


and the mixed component, 


Vuarx = 2E Hm (9; =9) (8 =.) p55. 
are taken with respect to the imputation model (condi- 
tionally on s and s,). The imputed estimator is linear and 
the derivations are straightforward because the quantities 
og! and gi!’ are constructed without using the y-values. 
The estimation of the sampling variance, V3.4 = E£,, 
var, (8), does not involve these two quantities (see equation 
5.5); thus, their possible non-linearity with respect to the 
sampling design does not cause any difficulty. This implies 
that nearest-neighbour imputation can be easily handled 
with our approach (see Beaumont and Bocci 2009). 

It is for all the above reasons that we believe that the 
reverse approach might be more cumbersome to implement 
in a generalized software package than our approach. This 
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does not mean that the reverse approach is not useful. In- 
deed, both approaches lead to identical variance estimators 
when a census is conducted. Beaumont, Haziza and Bocci 
(2011) showed that they also lead to identical panee 
estimators under auxiliary value imputation (because Oe 
and @\/’ do not depend on s and s,). Both approaches 
depend on the correct specification of the imputation model 
and no approach is expected to systematically outperform 
the other. 

The reverse approach may have a practical advantage 
over our approach when the sampling fraction is negligible. 
In such case, Shao and Steel (1999) showed that the second 
component on the right side of (7.1) can be neglected. The 
first component is estimated by finding a design-based 
estimator of var, (6, | U,). If a replication variance esti- 
mation technique (e.g., the jackknife or the bootstrap) is 
chosen for the estimation of var, (8, |U,), the whole 
approach becomes quite attractive and practical. Also, it 
does not depend on the validity of the imputation model; in 
particular, the correct specification of the model variance 
o;. The jackknife variance estimators of Rancourt, Lee and 
Sarndal (1993) and Sitter and Rao (1997) can be justified by 
this approach. 


8. Conclusion 


Our methodology for composite imputation has been 
implemented in version 2 of SEVANI because of its ease of 
implementation and generality. It works for most imputation 
methods used in practice, as most imputation methods are 
linear. The variance computations are the same for every 
composite imputation strategy once the quantities W\?, 
Wi’, i, and 6; have been computed. This eases the 
development of a generalized system. 

Although we have focused on the estimation of a domain 
total using the Horvitz-Thompson estimator, SEVANI can 
also deal with domain means and calibration estimators. 
paomeric and nonparametric methods of estimating pL, 
and oj are also available. Greater detail can be found in the 
Methodology Guide of SEVANI (Beaumont, Bissonnette 
and Bocci 2010) available upon request from the authors. 
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Alternative demographic sample 
designs being explored at the U.S. Census Bureau 


Patrick E. Flanagan and Ruth Ann Killion ! 


1. Introduction 


The United States (U.S.) Census Bureau Demographic 
Survey Sample Redesign Program, among other things, is 
responsible for research into improving the designs of U.S. 
demographic surveys, particularly focused on the design of 
survey sampling. Historically, the research into improving 
sample design has been restricted to the “mainstream” 
methods like basic stratification, multi-stage designs, sys- 
tematic sampling, probability-proportional-to-size sampling, 
clustering, and simple random sampling. Over the past thirty 
years or more, we have increasingly faced reduced response 
rates and higher costs coupled with an increasing demand 
for more data on all types of populations. More recently, 
dramatic increases in computing power and availability of 
auxiliary data from administrative records have indicated 
that we may have more options than we did when we 
established our current methodology. Thus, we began an 
initiative to explore alternative sampling methods. 


2. History of innovation in demographic 
survey sampling at the U.S. Census Bureau 


The U.S. Census Bureau was created by the Permanent 
Census Act of 1902. Up until the late 1930s, the U'S. 
Census Bureau’s demographic work was mostly focused on 
the logistics of running each decennial census and a myriad 
of special censuses. After the 1930 decennial census, the 
Census Bureau began research into sampling using the 
census data (Stephan 1948). 

Then, in 1937, the Census Bureau took its first major step 
into sample survey sampling with the 1937 Enumerative 
Check Census of Unemployment, which used a cluster 
sample of counties in support of a register census of the 
unemployed (Dedrick 1938). About the same time, the Cen- 
sus Bureau brought in sampling experts (e.g., W. Edwards 
Deming and Federick Stephan) in its decennial census 
expansion to assist in designing a sample survey in 
conjunction with the 1940 Decennial Census using a five 
percent systematic sample (Stephan, Deming and Hansen 
1940). In 1942, the Sample Survey of Unemployment was 
moved from the Works Progress Administration to the Cen- 
sus Bureau. This survey was already a three-stage sample 
with county primary sampling units (PSUs), systematic 
sampling of blocks, and sampling listed housing units in 


stage three (Frankel and Stock 1942). After its transfer to the 
Census Bureau (and a name change to the Monthly Report 
on the Labor Force (MRLF)), it was extensively redesigned 
in 1943, dramatically improving its efficiency using larger 
primary sampling units (PSUs) and probability propor- 
tionate to size for selection (Duncan and Shelton 1978). 
Later the survey was changed to improve month-to-month 
and year-to-year comparisons using a more complex 
overlapping sample approach in which a given household 
remains in sample for four months, is out of the survey for 
eight months and then is back into the sample for four 
months. Its name was also changed in 1947 to the Current 
Population Survey (CPS). Still, the basic sampling concept 
remained multi-stage sample design with county or county 
group PSUs. It remains that way to present though there are 
vast differences in the within-PSU sampling methods (US. 
Bureau of Labor Statistics and U.S. Census Bureau 2006). 
Over the last 60 years, the U.S. Census Bureau has designed 
many additional demographic surveys. Some of those sur- 
veys use the same two-stage design idea used in the CPS, 
like the Consumer Expenditures Surveys, the Survey of 
Income and Program Participation, the National Crime 
Victimization Survey, and the National Health Interview 
Survey. Some others are two-stage with selection of a list 
source followed by sampling from the lists like the Schools 
and Staffing Survey, the Private School Survey, and the 
Survey of Inmates of Local Jails. Still other are stratified 
samples from a sampled frame, such as the National Survey 
of College Graduates that has sampled from the Decennial 
Census Long Form, and the American Time Use Survey 
that samples from the CPS. In the early 1990s, The U.S. 
Census Bureau initiated the development of the use of 
continuous measurement as a possible replacement for the 
Decennial Census Long Form. Those efforts have since 
evolved into the current American Community Survey, 
which, starting 2010, will provide continual mid-decade 
estimates down to the block group level. The Census 
Bureau’s goal for improving our sampling methodology to 
the present leads us to explore alternative sample designs. 


3. Alternative survey sample 
design seminar series 


The exploration into alternative methods of sampling 
began with an initial seminar series that was held at the U.S. 
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Census Bureau. It consisted of three seminar presentations 
of such methods covering the statistical bases of the 
methods and their limitations, especially when applied to the 
types of demographic surveys conducted by the U.S. Census 
Bureau. Each presentation also included discussant com- 
ments by Professor Jean Opsomer from Colorado State 
University. Three articles were then developed providing 
greater detail on each topic and a final discussant article 
covering the three subjects. 


On 26 September 2007, Professor Steven K. Thompson 
from Simon Fraser University gave a presentation on 
his research into network sampling, spatial sampling, 
and adaptive sampling. 
On 9 January 2008, Professor Sharon Lohr from 
Arizona State University gave a presentation on her 
research into sampling using overlapping frames. 

. On 4 June 2008, Professor Yves Tillé from University 
of Neuchatel gave a presentation on his research into 
balanced sampling. 


The articles resulting from this project that follow are: 


“Adaptive network and spatial sampling,” by Steven 
Thompson; 


“Alternative survey sample designs: Sampling with multiple 
overlapping frames,” by Sharon Lohr; 


“Ten years of balanced sampling with the cube method: An 
appraisal,” by Yves Tillé; and 


“Innovations in survey sampling design: Discussion of three 
contributions presented at the U.S. Census Bureau,” by Jean 
Opsomer. 


4. Next steps 


Following these three presentations, it was decided to 
conduct further research into these methods and _ their 
application to either existing U.S. Census Bureau Demo- 
graphic surveys or to potential new surveys. There is 
already an urgent need for using multiple overlapping 
frames methods applied to the National Survey of College 
Graduates to deal with an old-cohort/new-cohort problem 
and a possible use of state hunting and fishing license 
registries as a second frame for the Fishing, Hunting, and 
Wildlife-Associated Recreation survey. We have plans to 
look at balanced sampling, particularly for selecting 
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geographic primary sampling units. Lastly, the methods of 
adaptive sampling have the potential for us to accept 
surveys that we traditionally have not taken on, as well as 
providing a lower cost alternative for surveys that meet 
certain criteria. 


5. Summary 


This exploration into these three areas of alternative 
sample designs is just the beginning of our seminar series 
and of our intentions to explore methods to improve our 
demographic survey sample design methods. Future antic- 
ipated subjects include alternative listing methods, Kish’s 
half-open interval approach to growth updates and coverage 
improvement, responsive survey designs, rejective sampling 
procedures, and model-assisted sampling. 


Acknowledgements 


This report is released to inform interested parties of 
research and to encourage discussion. The views expressed 
on statistical, methodological, technical, or operational 
issues are those of the authors and not necessarily those of 
the U.S. Census Bureau. 


References 


Dedrick, C.L. (1938). Census of unemployment 1937: Principle 
findings of the enumerative check census. U.S. Bureau of the 
Census. 


Duncan, J.W., and Shelton, W.C. (1978). Revolution in United States 
Government Statistics 1926 — 1976. U.S. Department of 
Commerce. 


Frankel, L.R., and Stock, J.S. (1942). On the sample survey of 
unemployment. Journal of the American Statistical Association, 
37, 77-80. 


Stephan, F.F., Deming, W.E. and Hansen, M.H. (1940). The sampling 
procedure of the 1940 population census. Journal of the American 
Statistical Association, 35, 615-630. 


Stephan, F.F. (1948). History of the uses of modern sampling 
procedures. Journal of the American Statistical Association, 43, 
12-393 


U.S. Bureau of Labor Statistics and U.S. Census Bureau (2006). 
Design and Methodology: Current Population Survey. 


Survey Methodology, December 2011 
Vol. 37, No. 2, pp. 183-196 
Statistics Canada, Catalogue No. 12-001-x 


183 


Adaptive network and spatial sampling 


Steve Thompson ' 


Abstract 


This paper describes recent developments in adaptive sampling strategies and introduces new variations on those strategies. 
Recent developments described included targeted random walk designs and adaptive web sampling. These designs are 
particularly suited for sampling in networks; for example, for finding a sample of people from a hidden human population 
by following social links from sample individuals to find additional members of the hidden population to add to the sample. 
Each of these designs can also be translated into spatial settings to produce flexible new spatial adaptive strategies for 
sampling unevenly distributed populations. Variations on these sampling strategies include versions in which the network or 
spatial links have unequal weights and are followed with unequal probabilities. 


Key Words: Network sampling; Snowball sampling; Random walk; Markov chain; Adaptive web sampling. 


1. Introduction 


An adaptive sampling design is a procedure for selecting 
the sample in which the probabilities of selecting the set of 
sample units from the population depend on values of the 
variable of interest observed during the survey. In a spatial 
Setting, adaptive sampling is exemplified by a survey in 
which, whenever a unit in the sample is observed to have an 
unusually high or otherwise interesting value of the variable 
of interest, nearby units may be added to the sample. In a 
network setting such as a socially networked human sub- 
population, a link-tracing design may be used to adaptively 
follow social links from sample individuals to locate and 
add additional members of the subpopulation to the sample. 

In spatial settings the development of adaptive designs 
has been motivated by such problems as estimating the 
abundance of rare, clustered plant and animal species, as- 
sessment of unevenly distributed environmental pollutants, 
and surveys of geographically clustered subpopulations of 
people. In network settings the development of adaptive 
network sampling designs has been motivated by problems 
in sampling people with rare diseases, sampling hidden 
populations such as those at high risk for HIV/AIDS or 
other epidemics, and sampling through computer and 
communications networks. 

Zacks (1969) and Basu (1969) recognized that in most 
cases the optimal sampling would in principle be an 
adaptive one. With a Bayes model for the population, at any 
step part way through a sampling procedure, one can do as 
well or better than a conventional design by selecting the 
remaining sample to give the lowest mean square error 
conditional on the observed sample values so far. The 
overall mean square error is the expected value of the 
conditional mean square error. The underlying mathematical 
principle is that the integral of the minimum of a set of 
functions is smaller, or nor larger than, the minimum of the 
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integrals. Results on optimal adaptive adaptive strategies are 
described and extended in Thompson and Seber (1996) and 
exemplified in Chao and Thompson (2001). 

In spite of the early theoretical results and motivation 
from field surveys, the importance of adaptive designs was 
not widely recognized for several decades in either theory or 
practice. The practical importance of adaptive sampling 
strategies became evident as statistical thinking was brought 
to bear on problems in natural resource management and 
environmental protection. The development of adaptive 
link-tracing designs for reaching hidden human populations 
has attained strategic importance for such problems as 
understanding and alleviating the global HIV epidemic. In 
addition, new interest in adaptive sampling methods is being 
spurred by problems of expense and effort in social surveys 
of all types. 

Adaptive designs such as those described in this paper 
often serve as high yield designs in that sample values of 
variables of interest tend to be higher on average than 
population means of the same variables. Although this is 
often a desired characteristic itself in studies of rare popula- 
tions, simple sample data summaries such as sample means 
and sample proportions are generally not good estimates of 
population means or proportions. Instead, effective design- 
based and model-bases estimators of population quantities 
have been developed for use with adaptive designs. 

With design-based estimators, properties such as un- 
biasedness or consistency depend solely on the way the 
sample is selected and not on assumptions about what the 
population may be like. Model-based estimators such as 
maximum likelihood or Bayes estimators on the other hand 
require use of a statistical model, usually involving un- 
known parameter values, describing the population of in- 
terest. Design-based estimators for adaptive designs are 
described in Thompson and Seber (1996), Thompson 
(2006a, b), and earlier papers. 
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Basic results for model-based approaches to inference 
with adaptive designs were given in Thompson and Seber 
(1996), which showed that likelihood-based methods such 
as maximum likelihood and Bayes inference would be more 
effective than other model based approaches (for example, 
the linear unbiased prediction approach) with adaptive 
designs. Maximum likelihood estimation and the likelihood 
based approach more generally with link-tracing designs 
were described in Thompson and Frank (2000). Bayes esti- 
mation with link tracing designs was used in Chow and 
Thompson (2003). A method combining model and design 
based features was used in Felix-Medina and Thompson 
(2004). Bayes estimation using Markov chain Monte Carlo 
(MCMC) with adaptive web sampling designs is described 
in Kwanisai (2005, 2006). 


2. Adaptive sampling in network settings 


A population has network structure if there are links or 
relationships between any of the units in the population. 
Mathematically, such a population is described as a graph, 
consisting of a set of nodes and a set of edges or arcs 
between nodes. More generally, each relationship between a 
pair of nodes may have a weight denoting the strength of 
value associated with the relationship. 

Human populations have an inherent network structure 
arising from social relationships. As will be noted later, 
spatial relationships also give a network structure to many 
populations. Network populations also arise in computing 
networks, communications, gene regulation and metabolic 
networks. 

Network structure in populations is important for two 
reasons. First, the network relationships may be of interest 
in themselves to researchers. For example, with contagious 
disease epidemics it is important to know the nature and 
pattern of the social contacts through which the disease 
spreads. Second, the network structure can be used to help 
in obtaining a sample from a population that is otherwise 
difficult to sample. For example, in the study of hidden 
populations at risk for HIV/AIDS, including drug injectors, 
commercial sex workers, and others, often the only way in 
many cases to obtain a sample large enough for the study is 
to follow social links from initial sample individuals to find 
more members of the hidden population. 

Most network sampling designs which follow links are 
inherently adaptive in that the link values used in the 
selection are variables of interest that are generally not 
known prior to the survey. Further, in some studies it may 
be of interest to follow links with higher probability from 
sample individuals with high values of variables associated 
with behavioral risk. 
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A class of designs called multiplicity sampling or simply 
network sampling was introduced my Birnbaum and Sirken 
(1965), along with design-unbiased estimators of population 
quantities. The approach was developed further by Sirken 
(1970, 1972a, b) and others and is described in Thompson 
(2002). In these designs the units on which observations are 
made are obtained by first selecting “selection units”, to 
which the observational units are linked. Motivation for 
these strategies came from problems in public health in 
which commonly used estimates were found to be biased 
because of the unequal numbers of such links. The simplest 
of the unbiased estimators in terms of computations was the 
“multiplicity estimator” which simply divided the observed 
value of a variable of interest measured on an observational 
unit by its “multiplicity”, the number of selection units to 
which it is linked. Horvitz-Thompson estimators for the 
strategy were also introduced. The following decades saw 
many variations on this strategy published in the statistics 
and substantive literatures. 

In snowball sampling an initial sample of nodes is 
selected by some design such as simple random sampling, 
and every link out is followed to add connected nodes to the 
sample. This process is continued for a specified number of 
steps, or “waves”. More generally, a subsample such as a 
fixed number of links are followed at each wave. Frank 
(1971, 1977a, b, 1978a, b, 1979) framed the problem as one 
of sampling in graphs worked out design-based estimators 
for many cases of snowball designs including designs with 
unequal initial selection probabilities and estimators for 
population quantities such as totals and means of variables 
associated with nodes or individuals, as well as of popula- 
tion link quantities such as mean degree, where degree of a 
node is defined as the number of links out (or in) from that 
node. Frank and Snijders (1994) introduced a number of 
design-based and model-based estimators for one wave 
snowball designs motivated by the problem of estimating 
the number of injection drug users in a city. 

In a random walk design an initial node is selected at 
random. From the links out from that node one link is 
selected at random and followed to add the connected node 
to the sample. This process is continued for a specified 
number of waves, with one unit selected at each wave. If the 
sampling is with replacement the design is a Markov chain, 
with the state of the chain at each step being the identity of 
the node selected at that step. Properties of such designs, 
cast as Markov chains in graphs, such as the limiting or 
stationary probabilities were examined in the statistics and 
probability literatures (Lovasz 1993). Random walk designs 
were introduced into the social network literature by 
Klovdahl (1989) with the motivation of reaching into a 
hidden human population farther away from the initial 
sample than possible with the same sample size using a 
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snowball design. In the computing science literature, Brin 
and Page (1998) used the concept of a stationary distribution 
of a random walk in a graph in developing a search engine 
and web page ranking algorithm, evoking the metaphor of a 
“random surfer” to describe the process of a random walk 
following hyper-links from web page to web page. 

Heckathorn (1997, 2002) and Salganik and Heckathorn 
(2004) described a sampling methodology referred to as 
“respondent driven sampling” in which members of a hid- 
den population were motivated to recruit other members of 
the population into the sample using a system of coupons. A 
simple estimator of population totals and means, in which 
each observation is weighted by the reciprocal of that 
person’s degree, was used with these designs based on the 
limiting distribution of a with-replacement random walk in a 
network having symmetric links and a single connected 
component. The coupon-based methodologies developed 
with these designs have proven to be highly effective in 
recruiting samples of substantial size from hidden popula- 
tions in a number of settings. 

The notational setup for sampling in networks follows. 
There is a population of units or nodes labeled Lego, 
with associated variables of interest y,, Vo5 +5 Vy. ASSO- 
ciated with each pair of nodes (i, J) is a link-indicator or 
weight, so that the collection {w,5 Le Lat ate vari- 
ables of interest associated with pairs of nodes. 

In the network context a sample s is a subset s“ of 
nodes and a subset s® of the pairs of nodes, that is, 
s =(s, s). Thus the sample consists of a sample of 
nodes, on which node variables y of interest are recorded, 
and a sample of pairs of nodes, for which the values of 
relationship variables w are recorded. 

Figure 1 shows a network-structured population which 
will be used to illustrate some of the network sampling 
designs described in this paper. In terms of a human popula- 
tion with social network structure, the red or dark colored 
nodes could represent individuals with high values of 
variables of interest, for example indicating a risk-related 
behavior such as injecting drug use. The light colored or 
yellow nodes would represent the individuals without the 
high-risk characteristic of interest. The links between indi- 
viduals would represent social relationships such as having 
meals together, drug-using relationships, or sexual contacts. 

Figure 2 shows an initial simple random sample of five 
nodes selected from the network-structured population. A 
one-wave snowball sample selected by following every link 
out from the initial sample is shown in Figure 3, and a two- 
wave snowball sample from the same initial sample is shown 
in Figure 4. Note that with a fixed number of waves, a 
snowball sample can grow very fast. 
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population graph 


Figure 1 A population with network structure 


sample 


Figure 2 A random sample of nodes 


sample 


Figure 3 One-wave snowball sample 
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sample 


Figure 4 Two-wave snowball sample 


With a snowball sampling designs and many other link- 
tracing designs, sample data summaries such as a sample 
mean or sample proportions are not good estimators of the 
analogous population characteristics. The reason is that 
under the design different units have different probabilities 
of selection, dependent on the population link structure. 
Figure 5 shows the population with the size of each node 
proportional to the probability of selecting that node. Since 
high-risk individuals tend to have more links hence higher 
probabilities of inclusion in the sample, the sample mean 
would tend to overestimate the population mean. In the 
same way, the average degree of such a sample would tend 
to overestimate the mean degree of the population network. 

With the one-wave snowball design in a setting with 
symmetric links the inclusion probabilities for sample nodes 
can be easily calculated as proportional to the node degrees. 
With asymmetric links or with snowball designs of more 
than one wave it is not in general possible to calculate node 
inclusion probabilities from the sample data. Methods for 
calculating design-unbiased estimators of population node 
and link characteristics with such designs are described in 
the section on adaptive web sampling later in this paper. 

Figure 6 shows a snowball sample from this same 
network population starting with one randomly selected 
unit. Since the population consists of more than a single 
connected component a strict random walk design would be 
stuck in whatever component it started in. It is therefore 
desirable to provide in the design some small probability at 
each step of selecting the next unit by simple random 
sampling or some other conventional design, or at least 
allowing a random jump whenever a walk is found to be 
stuck in a component. 

Figure 7 shows the stationary selection probabilities for 
the random walk through the network shown. Although 
these probabilities in this population are not simply 
proportional to node degrees it can be seen that nodes with 
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high degree do tend to have high selection probabilities. 
Also, since high risk individuals in this population tend to 
have high selection probability under this design, sample 
summaries such as sample mean and sample proportion are 
not unbiased estimators of population means and propor- 
tions. For unbiased estimates the methods of later sections 
of this paper would have to be used. 


one-wave selection probabilities 
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Figure 5 One-wave snowball sample selection probabilities 
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Figure 6 A random walk sample from the same population 
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Figure 7 Random walk limit selection probabilities 
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2.1 Targeted random walk designs 


One of the early motivations for using random walk 
designs with hidden populations was to penetrate deeper 
into the population, that is, farther from the initial sample 
and thereby obtain a more representative sample of the 
population. When the probabilities of selecting a given 
person by such a method are calculated either step by step or 
in their stationary limit, they are not in general equal but 
depend on the link and degree structure. With the moti- 
vation first to find a method for selecting a sample through a 
network such that the stationary probabilities would be the 
same for each person or node, uniform and targeted random 
walk sampling designs were developed (Thompson 2006a). 
An additional motivation was to find a more flexible and 
adaptable way to sample through a network. 

Since a random walk with replacement through a graph 
or network is a Markov chain, ideas of Markov chain Monte 
Carlo can be applied to produce a different Markov chain 
having desired stationary probabilities. At each step of the 
sampling the state of the chain is the current node added to 
the sample. The stationary probabilities of the chain corre- 
spond to the stationary selection probabilities for each 
person or node. With a targeted walk design the random 
walk design is tweaked at each step, based on out-degree of 
each node, to obtain a design with specified limiting 
selection probabilities. 

Suppose that at some step in the sampling person i is the 
last person who has been added to the sample. Using a 
random walk procedure we randomly select one of the links 
out from that person, and that link leads to person J, who is 
now our tentative selection. A screening interview reveals 
that person j has more links out than person i, so that the 
conditional probability of going from i to 7 as we Just did 
is larger than the conditional probability in the reverse 
direction, since the transition probabilities are related to the 
reciprocal of the number of links out. Therefore we calculate 
a probability less than one and accept person j into the 
sample only with that probability. If our tentative selection 
is not accepted we independently again choose a link out 
from person i. The probability of acceptance of the candi- 
date link is based on the Hastings (1970) generalization of 
the Metropolis algorithm. The acceptance probability 
depends on the desired target selection probabilities, the 
number of links out from the current node and the candidate 
node, and the probability of going in either direction with a 
random jump if that is part of the design (Thompson 2006a). 

Note that the method depends only on links out, which 
can usually be determined for sample members, whereas 
links in to sample individuals usually can not be determined. 
Therefore the method applies to directional as well as sym- 
metric networks. 
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A uniform walk design is the special case in which the 
targeted stationary selection probabilities are all equal. A 
targeted random walk design could be used for example to 
obtain a sample from a hidden population in which an 
individual with a certain high-risk behavior would have 
selection probability twice that of an individual without the 
behavioral characteristic. 

It is the sample of accepted people or nodes that has the 
desired stationary selection probabilities. If the tentative 
selections had been interviewed thoroughly also, not only 
the screening interview about out-degree, then in principle 
the estimates from the accepted sample could be improved 
using the Rao-Blackwell method (Casella and Robert 1996). 
That would involve calculation of the probabilities of 
getting the same data with different accept-reject results and 
in different orders of selection. With each of the different 
accept scenarios the estimate would be computed using the 
accepted set and each value weighted by the ordered 
selection and acceptance probabilities. In most cases there 
are too many combinations for exact calculation, and a more 
practical approach would be the Markov chain resampling 
method at the inference stage described in a later section of 
this paper. It is not clear that in practice it would be desir- 
able to compute the improved estimators using the data 
since full interviews rather than screening interviews would 
be required for those not initially accepted, the computations 
for the improvement are potentially demanding, and the 
calculation depends on knowing the selection probabilities 
for the initial sample, which is not needed for the simple 
estimators. 

With a targeted walk design in which the target stationary 
selection probability 2, of node i is proportional to c,, an 
asymptotically consistent estimator, based on the limiting 
probabilities, is provided by the generalized ratio estimator 


S, Vi /¢; 
a = a 
Du lye 


where y, is the value of the variable of interest for the i" 
node and s, is the sample of selected nodes. In this type of 
estimator the relative values of target probabilities need be 
specified since the proportionality constant cancels out. 

Note that a straight Horvitz-Thompson or Hansen- 
Hurwitz estimator can not be used because the propor- 
tionality constant in the inclusion probabilities is unknown, 
whereas in the generalized ratio estimator it cancels out. 
Again the limiting probabilities on which the estimator is 
based hold exactly for the with-replacement design. For the 
without-replacement variation, the properties of the targeted 
strategies were fairly closely approximated by the with- 
replacement properties in the empirical comparisons 
(Thompson 2006a). 
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2.1.1 Designs using weighted links 


Many studies of socially networked populations concep- 
tualize the network as having nodes (people) and lines or 
arrows representing the links or relationships between 
people. The network is characterized by an incidence matrix 
of 0s and 1s indicating when there is a link from node (row) 
i to node (column) j. In many real situations, however, 
more than one type of link may be of interest and links may 
have different weights representing differing strengths of a 
relationship. For example, in studies of risk behaviors and 
interventions in relation the the HIV epidemic, two types of 
links of high interest are sexual relationships and drug 
injecting relationships. Other social relationships, such as 
friendships and living arrangements, may also be of interest 
to investigators and may be useful in finding members of 
the population. These types of relationships may have 
weights corresponding to frequency of encounters, geo- 
graphic proximity, or other measures of strength. 

In the basic form of weighted link designs we consider, 
in which one link from the most recently selected person is 
selected from the links out from that person, the selection is 
made with probability proportional to link weight. More 
generally, the selection could be made based on that weight 
but not necessarily proportional to it. However, we could 
then redefine the weight to be proportional to the probability 
we have under the design of following that weight, so that 
the following result would still apply. 

The following derivation shows that under suitable 
conditions the stationary selection probability for each 
person with such a design is proportional to the sum of the 
link weights out from that person. The result applies for a 
population in which it is possible to reach any one person 
from another following some path in which each link has 
weight greater than zero. That is, the population has a single 
component. 

For such a condition to hold it is advantageous to have at 
least some probability of following common but weak links. 
For example, a study of a sexually transmissible epidemic 
may want to focus with high probability on sexual links. But 
sexual links do not connect the population into a single 
component. Therefore, some smaller probability is allowed 
in the design for following friendship or geographic links, 
which represent weaker relationships between people and 
are of less inherent interest to investigators but serve to 
connect the population. Thus, the combination of different 
types of links in this situation turn the population into a 
single component for purposes of the design. 


2.1.2 Stationary distribution of weighted link 
Markov chain design 


In this section we derive the stationary distribution of a 
weighted link design in a single component situation. Keep 
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in mind that we may create the single component property 
through innovative use of geographic links in combination 
with social links. 

Let w, be the weight of a link between node i and node 
j, and assume that these links are symmetric, so that 
w, = w,,, Consider a random walk design, with replace- 
ment, in which the transition probability to node j, given 
the walk is at node i, is proportional to w,. That is, one 
link is selected out from node i with probability propor- 
tional to weight. The transition probability is thus F, = 
w,/w,. The sum w, = 2, is the total weight out from 
node i, generalizing the concept of degree with equally 
weighted nodes. 

Suppose the graph has only a single component, that is, 
any node in the graph can be reached from any other node 
by a path in which every link has positive weight. Then the 
stationary probability for node i is proportional to w,. 

Suppose that the probability that the walk is at node i at 
time ¢ is n, =w,/w., for i=1,...,.N, where w. = 
Lidj Wy, the total of all the weights. Then the probability 
that the process is at node i at time +1 is },2,P, by 
the law of total probability. In terms of the link weights, this 
sum is }j(w,./w.) (Wj: /w,.) = 2), /w.. Because of the 
symmetry of the weighted links, this becomes w,/w., so 
that if node i has this probability at time ¢ it has the same 
probability at time + +1, so that these are the stationary 
probabilities of the process. By induction, once the process 
reaches it’s stationary distribution it remains in it for every 
step thereafter. In practice, especially with small sample 
sizes or with different design variations, the stationary 
distribution serves as an approximation to the exact distri- 
bution. 

If the weights are not symmetric, the selection probabili- 
ties of the random walk design will still approach a station- 
ary distribution provided there is only a single component 
or, if not, that the design incorporates random jumps. How- 
ever, with the directional weighted links, the stationary 
distribution is no longer of the simple form that can be 
calculated from sample data. 


2.1.3 Different uses of weighted link designs 


Variations of weighted link designs could prove useful in 
situations of the following types. 

(1) Designs using general weights of links, on a con- 
tinuous or discrete scale, representing strength or im- 
portance of relationships and probability of following 
them. 

(2) Situations with two types of links, represented by 
two weights, such as social networks with strong and 
weak relationship links, or an HIV-at-risk study fo- 
cusing on both sexual contacts and drug using rela- 
tionships. 
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(3) Survey settings in which links represent the geo- 
graphic or “random jump” part of the design, or the 
seed design. For example, all people within a given 
geographic stratum are linked by a geographic link, 
or all the people who visit any of the venues on an 
ethnographic map are thereby linked. 

(4) Ina situation where a sampling frame exists but the 
frame covers only part of the population, all units 
within the frame can be considered to be connected 
by a “frame link”. Venue-based sampling typically 
forms one example of this type of situation. 

(5) Using a variation on the sampling design as a model 
for the way a virus or other infectious agent “sam- 
ples” people in a population. A type of weighted link 
design could be developed as a model for the spread 
of an infectious disease, finding the different impor- 
tance of different links. For influenza, the relative 
importance of air transported droplets (sneezing, 
coughing) versus indirect contact through solid 
objects (door knobs, money). For HIV, the relative 
importance of different types of sexual contacts and 
unsafe injections, whether for illegal drugs or un- 
sanitary medical injections especially in third world 
countries. The disease transmission in a simulation 
has a slightly different protocol than the implemented 
designs, in that instead of thinking of one new link 
selected at each selection time step, there could be 
anywhere from zero to a high number of transmis- 
sions in a time step. 


2.1.4 Properties of weighted link designs and 


associated population graphs 


Suppose the relationships in the population are assigned 
weights, with the weight w, denoting the strength of the 
relationships from node i to node j. And suppose we use a 
link tracing design of the walk type in which the transition 
probability is 

apt 
ij w, 
where w, = Yi, w,. This is the conditional probability of 
selecting node j as the next sample unit, given the most 
recently selected unit is node i. The walk design is a 
Markov chain on a graph, in which the graph has weighted 
links. 

We will next consider the question in the other direction 
of when a Markov chain can be represented by a design of 
this sort on a graph with weighted links. Given a Markov 
chain specified by a matrix of transition probabilities F,, 
we can always represent it as a walk design of this type on a 
graph with weighted links so long as the links satisfy the 
first of the following properties: 
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(1) w, =P, w,, where the row weight totals are arbi- 
trarily chosen. 

Next consider imposing some property on the weight 
row totals to make them unique. For example: 

(2a) Ifthe w, weight row totals are chosen to be all equal 
to a constant such as one, then the link weights 
represent the conditional transition probabilities given 
the process is at the node at which they originate. 

(2b) If the w, weight row totals are proportional to the 
stationary probabilities , of the Markov chain for 
each node i, or equal to them, then the weights rep- 
resent “flows” of the Markov chain, that is, the un- 
conditional probabilities of transitions along the 
links: 


In the practical situations for which we are trying to find 
appropriate models and designs, the weights may be at least 
partially given by the natural circumstances of the situation. 
For example the weight w, may represent the presence or 
absence of a link from person i to person j, or the number 
of transactions of a certain type in a given time period from 
i to j. In that case, condition (2a) above would not in 
general be satisfied and condition (2b) would be satisfied 
only if all the weights were symmetric, that is, if Wy = Ww, 
for all i and ;. 

In particular, if some or all of the weights are asym- 
metric, with W, FW, then (2a) would not usually be 
satisfied and it would not be possible to arbitrarily choose 
weights to impose the condition because typically the sta- 
tionary probabilities would not be known and could not be 
calculated from the sample data. However, although the row 
totals w, could not be arbitrarily imposed, they can be 
known for units in the sample since they are simply the total 
weight out from each unit. 


2.2 Adaptive web sampling 


Targeted random walk designs provide considerable 
flexibility and control not offered by regular random walks. 
The use of weighted links with these designs extends that 
flexibility farther. This flexibility is still constrained, 
however, by the restriction that the selection of the next link 
to follow can depend only on the most recently selected 
node in the sample. The incentive for developing the next 
set of designs was to remove this restriction and greatly 
expand the scope for flexibility and control in the available 
strategies. 

In an adaptive web sampling design (Thompson 2006b) 
an initial sample of one or more unit/node is selected by 
simple random sampling or other conventional design. From 
then on, at each step in the sampling there is an active set 
consisting of the sample selected so far or some subset of it. 
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In the simplest case, one link is selected from the links out 
from this set. Sampling continues in this fashion until the 
desired sample size or some other stopping criteria has been 
satisfied. Some small probability is allowed, however, that 
the next node is selected at random, or by some other 
conventional design, from the entire population. The designs 
can be done with or without replacement. 

More generally a set of links can be selected at each step. 
Also the links at each step can be selected by a design more 
complicated than simple random sampling. The selection 
probabilities can be dependent on node or link characteris- 
tics and can be varying over time. 

The basic idea of an adaptive web sampling design is 
shown in the next set of figures. In Figure 8, an initial 
sample of two nodes has been selected by random sampling 
without replacement. At the next step a link may be chosen 
out at random from either of the initial nodes to add a new 
node to the sample, as shown in Figure 9. The next node is 
selected by following one of the links out from the current 
sample. With a random walk a link would need to be 
followed from the last node selected, but with adaptive web 
sampling any eligible link out from the current sample 
(active set) may be followed. Note the next selection, shown 
in Figure 10, is not via a link from the most recently selected 
node, but from a previous one. As sampling progresses it is 
free to branch out flexibly in different directions as well as 
select new nodes at random from the population (Figure 11). 
The design can be stopped at a specified sample size or 
some other criteria. In the design shown in the figures, links 
out from the current sample were not selected completely at 
random but with higher probability given to following links 
from high-risk individuals, represented by dark or red 
nodes. Further, the design shown allowed a 0.1 probability 
of selecting the new node at random at any step instead of 
following a link. 


weighted links 


Figure 8 The first two nodes selected at random 
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Figure 9 The next node is selected by following one of the links 
out from the current sample 
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Figure 10 Note the next selection is not via a link from the last- 
selected node, but from a previous one 


weighted links 


Figure 11 As sampling progresses it is free to branch out flexibly 
in different directions as well as select new nodes at 
random from the population 


2.2.1 Inference methods 


Design-unbiased and design-consistent estimation meth- 
ods for use with adaptive web sampling designs are 
described in Thompson (2006b). Bayes model-based esti- 
mation methods for use with adaptive web sampling are 
described in Kwanasai (2005). 
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The design-based estimators are constructed by starting 
with some relatively easy to compute estimator that depends 
on the order of selection of the sample. This initial estimator 
is then improved using the Rao-Blackwell method, that is by 
obtaining the expected value of the initial estimator condi- 
tional on the minimal sufficient statistic. 


2.3 Estimator based on initial sample mean 


Suppose fi, is an unbiased estimator of the population 
mean that depends on the order in which the sample is 
selected. If the initial sample of nodes has been selected by 
simple random sampling, one example of an unbiased initial 
estimator that depends on order is the initial sample mean. 
The improved estimator has the form 


A= E(io|d,)= >) fig(s) p(s| d,). 


{sir(s)=s} 


Here s denotes the sample in order of selection, r is the 
reduction function that reduces the ordered sample to s, the 
unordered sample of the minimal sufficient statistic. The 
reduced data d, consists of the unordered sample together 
with the associated values of the variables of interest. The 
improved estimator {i is the expected value of the initial 
estimator over all n! reorderings of the sample data. In 
calculating the expectation, each of the reorderings is 
weighted by the selection probability p(s | d,). 

Other initial estimators used with adaptive web sampling 
utilize the entire sample data but depend on order and are 
based on using the conditional probabilities of selecting 
each new unit in sequence given the previously selected 
units. Four types of design-based estimators for use with 
adaptive web sampling are given in Thompson (2006b). 

Computation of the improved estimator fi and its 
variance estimators under various adaptive web designs 
involves enumerating the reorderings of the sample 
selection sequence. For each reordering, the probability of 
that ordering under the design is computed, along with the 
values of the estimators and variance estimators. Direct 
calculation is fast and efficient up to sample sizes of ten or 
so, which involve no more than a few million permutations 
to be enumerated. For larger sample sizes, the numbers of 
permutations or combinations of potential selection se- 
quences in the conditional sample space become prohibi- 
tively large for the exact, enumerative calculation. For this 
reason, a Markov chain resampling approach was used in 
Thompson (2006b) for computing the improved estimators. 

The resampling procedure is as follows. The object is to 
obtain a Markov chain x), x, x,,... having stationary 
distribution p(x |d,). Here x, denotes an entire reordering 
of the sample at step & of the chain. Suppose that at step 
k —1 the value is x,_, = j, so that A denotes the current 
permutation of the sample data in the chain. A tentative or 
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candidate permutation c, is produced by applying the 
original sampling design, with sample size n, to the data as 
if the sample comprised the whole population, that is, as if 
N =n. This resampling distribution, denoted p.. differs 
from, but has some similarity to, the actual sampling design 
p. The desired conditional distribution p(x | aay ads 
proportional to the unconditional distribution p(x) under 
the original design applied to the whole population. 
Let 


ns in| PD PC) i 
Pie plc) 


With probability a, ¢, is accepted and x, = c,, while with 
probability 1 — a, c, is rejected and x, = nee 

This procedure produces a Markov chain x), x, x, ... 
having the desired stationary distribution p(x | eee 
chain is started with the original sample s in the order 
actually selected. Given any value of the minimal sufficient 
statistic d., the chain is thus started in its stationary 
distribution and so remains in its stationary distribution step 
by step. 

Suppose that n, resampled permutations are selected by 
this process and let fi,, denote the value of the initial esti- 
mator for the h' permutation. An enumerative estimator of 
the form fi = E(fi, |d.) is replaced by the resampling 
estimator 


Bayes model-based inference with adaptive web sam- 
pling designs also requires the use of Markov chain Monte 
Carlo (MCMC) methods except in certain fairly simple 
design situations (Chow and Thompson 2003) where explic- 
it Bayes posterior distribution, estimators, and intervals can 
be obtained. More generally the MCMC sequence involves 
at each step updating of model parameter estimates and, in a 
data augmentation procedure, obtaining a complete realiza- 
tion of the population network and its values from the 
predictive posterior distribution conditional on the observed 
data (Kwanisai 2005, 2006). The resulting Markov chain 
sequence of complete population realizations provides the 
flexibility to make inference about many types of population 
characteristics. 


2.4 Modification of adaptive web sampling 
procedures 


Adaptive web sampling designs are a generalization of 
random walk designs. The more general designs do not have 
the exact stationary distribution properties of walk designs, 
since more than one link may be followed from any node, 
links may be followed from sample nodes other than the 
most recently selected one, and the sampling may be done 
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without replacement. However, the stationary distribution 
properties of a random walk or other Markov chain design 
may serve as a guide to approximate properties one might 
expect from a similar adaptive web sampling design. 

During the sampling, at the time of the t unit selection 
in the k" wave, let Wa,,+ be the total number of links out, 
or the total of the weight values, from the active set a, to 
units not in the current sample s,,,. That is, wa,+ = 
Denver oye When w is an indicator variable, wa,,+ 1S 
the total of the net out-degrees of the individual units in the 
active set a,, where net out-degree is the out-degree of a 
unit minus the number of its links to other units already in 
the current sample. 

For each unit i in the sample, the variable of interest y, 
and the out-degree (or out-weight) w,, are recorded. In 
addition, for each pair of units (i, 7) for which both 7 and 
j are in the sample, the values of the link variables w, and 
w,, are observed. 

“Consider as a candidate for the ¢™ selection in the k" 
wave a unit i not in the current sample, so i ¢ s,,. Sup- 
pose the current active set a, contains one or more units 
having links or positive weights out to unit i, and let 
Wai = Lijea, Wij denote their total. The probability that unit 
i is the next unit selected is 
Wagi ] 


IP) 
me ) Van.) 


qkti = b 

Way, + 

where b is between 0 and 1. If there are no links at all out 
from the current active set, then 


I 
Viti (N —n,,) 

Thus, with probability b link-tracing is done, and one of 
the links out from the current active set is selected at 
random, or with probability proportional to its weight, and 
the node to which it leads is added to the sample, while with 
probability 1—b the new sample unit is selected com- 
pletely at random from the units not already selected. How- 
ever, if there are no links or positive weights out from the 
active set to any unsampled units, then the next unit is 
selected from the collection of unsampled units. 

Basic adaptive web sampling can be generalized to use 
weighted links. If the relationship variable w consists of 
weights, instead of having just 0 or 1 values, then the link- 
based selection can depend on these weights. For example, 
link weights can be defined in relation to the y value of an 
originating node or as a distance measure to the connected 
node, so that links are followed with higher probability from 
nodes with higher values or with lower probability to distant 
nodes. Then a link from the active set can be selected with 
probability proportional to link weight, or with some other 
selection probability p(i| Sekt, Gk, Vay» Wax ) depending on 
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variables of interest only through the active set. For 
example, a link out could be selected at random from the 
links with w, greater than some constant, or y, greater than 
some constant. The selection probability when links are not 
followed does not have to be uniform over the units not in 
the current sample, but can be a more general design 
p(i|s,4,) such as selecting with probability related to an 
auxiliary variable or from a spatially defined distribution. 

With weighted links w represents a possibly continuous 
link weight variable and the probability that unit 7 is the 
next unit selected is 


Viti bp(i | Sokt> Ay, Ya, > Wey) ." @! oF b) pd | Sext): 


If there are no links or positive weights from a, to i, then 
Wii ~ pl | Sci): 


Once unit i has been selected, it is possible to add an 
accept/reject step for deciding whether to include it in the 
active set, for example, accepting with higher probability if 
unit i has a high value or high degree. 

In the design the constant b itself can also be replaced 
by a probability b(k, t, ax, Ya,, Wa,) depending on values 
related to nodes and links in the active set or changing as 
sample selection progresses. For example, if the values of 
the units in a, are particularly high, we could increase the 
probability of following links. As for dependence of b on 
(k, t), the use of an initial conventional sample of size 
n > 1 may be viewed as serving to obtain some informa- 
tion from basic coverage of the population before adaptive 
sampling is allowed to commence. 


3. Spatial adaptive web sampling 


Adaptive sampling designs such as adaptive cluster 
sampling (Thompson 1990) were developed in response to 
the need for more effective strategies for sampling spatially 
uneven populations, particularly those having a rare, clus- 
tered geographic distribution. Most populations having a 
network structure also have an inherent geographic or 
spatial structure. For example, human populations have 
social network structure but are also distributed in space. Of 
particular interest from the sampling design point of view, 
spatial structures can be characterized with graph or network 
structures. For example, neighborhood relationships based 
on geographic proximity can be recast in the form of lattice- 
type graphs. In this way, network designs such as those 
described in the previous section can be applied to solve 
spatial sampling problems. 

In this section the use of adaptive web sampling designs 
to sample a spatially uneven population will be described. 
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These designs could be viewed as a generalization of 
adaptive cluster sampling. In this view, adaptive cluster 
sampling would be a special case in which every link is 
followed until there are no more links out from the current 
sample. The adaptive web sampling class of designs offers 
more flexibility and control, however, and is potentially 
more efficient to use for many spatial populations. 

With adaptive cluster sampling the constraint to continue 
to sample until all neighbors of all units satisfying the 
condition were included meant that overall sample size was 
not controlled in advance and was rather stringent when 
some networks were unusually large. Adaptive web sam- 
pling in the spatial context solves this problem since sample 
size can be fixed in advance. In terms of its network 
recasting, the simple unbiased estimators of adaptive cluster 
sampling use data only from the strongly connected compo- 
nents that the initial sample intersects. Rao-Blackwell 
improvements based on those estimators can use in addition 
data from the weakly connected extensions of those compo- 
nents. The familiar edge units of spatial adaptive cluster 
sampling are a special case of such weakly connected 
extensions of strongly connected components. 

Figure 12 depicts a study region with a spatial clustered 
population as may be encountered in ecological, epi- 
demiological, and social demographic surveys. In one form 
of adaptive spatial designs the neighborhood of a unit is 
defined as the set of immediately adjacent units, and neigh- 
boring units are added to the sample when the value of a 
sample unit is high or meets some other criterion. In Figure 
13 the spatial population has been recast as a directed graph. 
The square spatial units are redrawn as nodes in a graph, 
and whenever the number of objects in a unit exceeds zero, 
arrows representing graph links are drawn from that node to 
neighboring nodes. Nodes representing units with nonzero 
values are colored dark (red). Figure 14 shows a random 
sample of nodes to be used as the initial sample of an 
adaptive web design. The adaptive web sampling continues 
until the targeted final sample size of 20 units is obtained in 
Figure 15. The sample is recast in the spatial setting in 
Figure 16. Unlike adaptive cluster sampling, it was not 
necessary to continue sampling until every unit in a sampled 
connected component is included. Further, the small 
probability of a random jump keeps the design from being 
stuck in any connected component. 

A glimpse of the immense flexibility offered with the 
adaptive web sampling designs in the spatial setting is 
shown in Figure 17. In the top row a spatial population is 
recast as a graph, though the directions of the links are not 
shown. The bottom row shows samples from two variations 
of adaptive web sampling. On the left, sixteen initial units 
have been selected independently at random. From each, an 
adaptive web sampling procedure is carried out to a sample 
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size of five units. With this design, the sample is spread 
throughout the study region while also reaching into compo- 
nents. In the design on the right a single initial unit is 
selected at random and adaptive web sampling continues to 
a total of 80 units. The 0.1 probability of selecting the next 
unit at random at any step prevents the design from being 
stuck in any one component. With this design the main 
components or aggregations get very thorough, though not 
exhaustive, coverage. 
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Figure 12 A spatially clustered population 
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Figure 13 A network representation of relevant neighborhood 
relationships in the spatial population 
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Figure 14 An initial random sample of spatial units 
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Figure 15 Adaptive web sample of 20 units starting from the 
initial sample of the previous figure 
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Figure 16 Spatial representation of the adaptive web sample 
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3.1 Spatial designs with weighted links 


For selecting spatial samples, link weights can be defined 
as a function of the distance between sites. For example, for 
increased sample the function would give larger weight to 
sites at close distance. On the other hand, for space filling 
purposes sites at larger distance could have larger weight. A 
network sampling design in such a setting, with link weights 
defined solely on the basis of distance, would not in general 
be adaptive. That is because the spatial frame would enable 
a link-tracing design to select the entire sample of sites 
before going in the field to make any observations. 

More generally though link weights can be defined as a 
function of both weights and observed values. For a unit in 
the sample having a high observed value of the variable of 
interest, the function could give higher weight at distances 
close to that site and smaller weight to distance sites. For a 
unit having a low value of the variable of interest the weight 
function could have a more uniform shape. 

Random walk designs in particular are straightforward to 
carry out in spatial settings with links weights dependent 
only on distance. That is because at any point in the sam- 
pling the selection of the next site depends only on the most 
recently selected site, so that only one weight function needs 
to be considered. With more general designs such as 
adaptive web sampling the use of link weight functions 
dependent on both distance and value opens up very wide 
flexibility in the possibilities available for adaptive strategies. 
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Figure 17 Adaptive web sampling design variations 
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4. Discussion 


Adaptive sampling designs expand considerably the 
possibilities for sampling strategies. They appear to be espe- 
cially useful for populations which are otherwise difficult to 
sample. Network sampling designs are inherently adaptive 
in most cases and can provide more effective ways to sam- 
ple populations with network or spatial structure. In this 
paper the emphasis has been on designs obtaining low mean 
square error or providing practical means of reaching a 
hidden population. In other cases the primary objective 
might be simply to obtain a higher yield sample, that is, a 
sample having a high total value of the variable of interest. 
For instance environmental hot spots is where remediation 
must be made, high risk components of a epidemic related 
network where treatment or intervention might have the 
greatest effect. The advantages of an adaptive approach are 
even more straightforward when the objective is high 
sample yield. 

Fully optimal sampling strategies are in most cases not 
practical to implement, because of computational complexi- 
ty and model dependency. A more practical approach is to 
make improvements over conventional designs with simple 
adaptive procedures that capture much of the essence, and 
the choice of design often having much more effect that one 
inference method versus another. 

Simulation analyses with adaptive strategies of different 
types have tended to lend support to the idea that it is good 
to have a strong underlying conventional component. Many 
of the practical strategies have the form of an initial conven- 
tional sample with adaptive sampling extending the sample 
from there through either network or spatial relationships 
and depending on observed values. Strategies with that type 
of balance between conventional and adaptive components 
have in simulations generally performed better than, say, 
selecting a single unit conventionally and adaptively adding 
the whole rest of the sample from there. In the simulations 
most efficient strategies tended to have an initial sampling 
making up about 60-80 percent of the total sample size. The 
modest amount of adaptive sampling after that then pro- 
duced large gains in efficiency. -This empirical experience 
goes along with the characteristic of optimal adaptive 
strategies, in which there seems to be a push and pull 
between spreading units far apart or filling in unobserved 
parts of the study region, corresponding to the conventional 
component of the simplified designs, and placing new units 
in the most promising areas, corresponding to the adaptive 
component in the simplified designs. 
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Alternative survey sample designs: Sampling with multiple overlapping frames 


Sharon L. Lohr ! 


Abstract 


Designs and estimators for the single frame surveys currently used by U.S. government agencies were developed in 
response to practical problems. Federal household surveys now face challenges of decreasing response rates and frame 
coverage, higher data collection costs, and increasing demand for small area statistics. Multiple frame surveys, in which 
independent samples are drawn from separate frames, can be used to help meet some of these challenges. Examples include 
combining a list frame with an area frame or using two frames to sample landline telephone households and cellular 
telephone households. We review point estimators and weight adjustments that can be used to analyze multiple frame 
surveys with standard survey software, and summarize construction of replicate weights for variance estimation. Because of 
their increased complexity, multiple frame surveys face some challenges not found in single frame surveys. We investigate 
misclassification bias in multiple frame surveys, and propose a method for correcting for this bias when misclassification 
probabilities are known. Finally, we discuss research that is needed on nonsampling errors with multiple frame surveys. 


Key Words: Bias correction; Dual frame survey; Misclassification; Mode effects; Sampling for rare events: Sampling 


weights; Small area estimation. 


1. Uses of multiple frame surveys 


In classical design-based sampling theory, a probability 
sample is taken from the (single) sampling frame, and the 
inclusion probabilities in the sampling design can be used to 
make inferences about the population. Let ye bea 
measurement on unit 7 in the population of N units, let S 
denote the set of units in the sample, and let tc — P (unit 7 
is included in the sample). Then the Horvitz-Thompson 
(1952) estimator of the population total Y = >%, y; is 
r= Dies WY;, Where w, = 1/1, is the sampling weight. 
If the sampling frame includes everyone in the target popu- 
lation, all sampled units respond, and there is no measure- 
ment error, then the Horvitz-Thompson estimator is un- 
biased for Y. 

The practical challenges of sampling in the 1940s and 
1950s drove the methodological developments of stratified 
multistage surveys and estimators such as the Horvitz- 
Thompson estimator. In-person surveys relied on unequal 
probability sampling to balance interviewer workloads and 
reduce variances. Response rates were high in many gov- 
ernment surveys so that the assumptions for the Horvitz- 
Thompson estimator were reasonable. We now face new 
challenges in household surveys. Nonresponse rates are 
increasing, which means that survey estimates rely more on 
models. The ethnic and language diversity of a population 
can result in undercoverage and measurement error. In- 
creasing technological diversity means that different resi- 
dents may be best reached by different sampling modes; one 
must then be confident that different sampling modes 
measure the same quantities. Costs of collecting data have 
risen greatly, in part due to increasing nonresponse; at the 


same time, governmental and research demands for data 
have also risen greatly. 

Multiple frame surveys can achieve better population 
coverage at lower cost. They can be used as part of a struc- 
ture of modular survey design that relies on different sam- 
pling frames to help reduce costs and achieve better cov- 
erage. They can also use administrative data efficiently. In 
this paper, we describe different types of multiple frame 
surveys and discuss some of the research that is completed 
and research that may be needed for their use. 

One of the earliest multiple frame surveys (aside from 
early capture-recapture methods) was performed by the 
Census Bureau in 1949 (Hansen, Hurwitz and Madow 
1953). In the Sample Survey of Retail Stores, a probability 
sample of primary sampling units (psus) was chosen. Within 
each psu, a list of large retail firms was constructed from 
records of the Old Age and Survivors Insurance Bureau. All 
firms on the list were sampled, and an area sample of firms 
in the psu that were not on the list was taken. In this case, a 
screening dual frame design was employed within each 
selected psu; units in the list frame were screened out of the 
area frame before sampling. Thus, the estimator of total 
sales summed the two estimators within each psu. No new 
statistical methods were required to estimate total sales in 
this survey, since essentially a stratified sample was taken in 
each psu: the firms on the list in the psu formed one stratum, 
and the firms in the area frame but not on the list in the psu 
formed the second stratum. The survey resulted in cost 
savings because it was relatively inexpensive to sample 
from the firms on the list, yet full coverage was obtained by 
also using the area frame. 
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Many agricultural surveys also have used a screening 
dual frame survey design (Gonzalez-Villalobos and Wallace 
1996). In such a design, farms belonging to the list frame 
are removed from the area frame before sampling com- 
mences. Considerable cost savings can be realized since 
often the list frame is much less expensive to sample and it 
contains the largest farms. 

In many cases, however, it may not be possible or prac- 
tical to remove list-frame units from the area frame before 
sampling. Instead, in an overlapping dual frame survey, 
independent probability samples are taken from frame A 
(the area frame) and frame B (the list frame); this is depicted 
in Figure 1. Rare populations can often be sampled more 
efficiently using a multiple frame sample (Kalton and 
Anderson 1986). In an epidemiology study, for example, 
frame A might be that used for a general population health 
survey, while frame B might be a list frame of clinics spe- 
cializing in a certain disease. The sample from frame B is 
expected to yield a high percentage of persons with the 
disease of interest, so that sampling will be efficient; the 
sample from frame A, though more expensive, leads to 
complete coverage of the population. 

In other situations, all frames are incomplete, as con- 
sidered by Hartley (1962); for example, frame A in Figure 2 
might be a frame of landline telephones and frame B might 
consist of cellular telephone numbers. There are three do- 
mains: domain a consists of units in frame A but not in 
frame B, domain b consists of units in frame B but not in 
frame A, and domain ab consists of units in both frames. In 
the telephone context, domain a contains individuals be- 
longing to a landline-only household, domain 6 consists of 
individuals who have only a cellular telephone, and domain 
ab consists of individuals who have both cellular and land- 
line telephones. It is unknown in advance whether a house- 
hold member sampled using one frame also belongs to the 
other frame (Brick, Dipko, Presser, Tucker and Yuan 2006); 
typically, respondents are asked about their cellular and 
landline telephone usage to determine domain membership. 


> 


Figure 1 A dual frame design in which frame B is a subset of 
frame A 
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More than two frames can be employed as well, as illus- 
trated in Figure 3 for a three-frame survey in which all 
frames are incomplete. In this situation, there are seven 
domains. Iachan and Dennis (1993) gave an example of a 
three-frame survey used to sample the homeless population, 
where frame A is a list of soup kitchens, frame B is a list of 
shelters, and frame C consists of street locations. Figure 4 
displays a 3-frame survey in which frame A has complete 
coverage, while overlapping frames B and C are both 
incomplete but are less expensive to sample. This design has 
been used for the U.S. Scientists and Engineers Statistical 
Data System (SESTAT;, National Science Foundation 2003) 
surveys. The same design might be used when A is the 
frame for a general population survey, B is a landline 
telephone survey, and C is a cell phone survey. 


Figure 2 Frames A and B overlap, creating the three domains 
a,b, and ab 


Figure 3 Frames A, B, and C are all incomplete and overlap 


There is much potential for using multiple frame designs 
in household surveys, including: 


1. Use of multiple list frames from administrative 
records. 

2. Multiple mode sampling (for example, using inde- 
pendent samples from a cellular telephone frame 
and a landline telephone frame). 
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3. Future use of the internet for data collection. Al- 
though the internet presents many coverage and 
domain specification challenges, it is worthy of 
consideration because of the potential cost savings 
and ease of data collection and processing. 

4. Improved small area estimation. A national survey 
is supplemented with smaller, localized surveys to 
obtain higher precision in those areas. 

5. Improved estimation for rare populations. A general 
population survey may be supplemented by a survey 
from a frame with a high concentration of members 
of the rare population. 

6. Modular survey design. A multiple frame approach 
can give more flexibility for design of continuing 
surveys. As particular frames become less expen- 
sive to sample, the relative allocation of sample size 
to the different frames can be modified. The modu- 
lar approach also allows more flexibility in re- 
sponding to changing needs for data. 


Figure 4 Frame A contains the entire population; frames B and C 
overlap and are both contained in frame A 


The increased flexibility of multiple frame surveys 
comes at the cost of additional complexity, however. 
Information from the surveys must be combined to estimate 
population quantities, and there are many options for esti- 
mators. Section 2 summarizes estimators that have been 
developed for population totals and describes how these 
modify the sampling weights; Sections 3 and 4 discuss 
weight calibration and describe how to use survey software 
packages with multiple frame survey data. Nonsampling 
errors need to be considered in each frame singly, and in 
terms of their effect on estimates calculated from the com- 
bined information. Section 5 discusses effects of non- 
response and mode effects in multiple frame surveys. 

In addition to the nonresponse, undercoverage, and mea- 
surement error problems that plague single frame surveys, 
multiple frame surveys may have domain misclassification. 
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The weight modifications for the estimators in Section 2 
depend on the domain membership of the observations. If 
some observations in domain a are likely to be mistakenly 
recorded as belonging to domain ab, estimators may have 
substantial bias. We study effects of domain misclassify- 
cation in Section 6, and propose a new method for adjusting 
for misclassification bias when misclassification probabili- 
ties are known. Finally, Section 7 discusses design issues 
and Section 8 discusses the potential and challenges of mul- 
tiple frame surveys. 


2. Estimators in overlapping 
multiple frame surveys 


In this section we review estimators for the population 
total Y from overlapping multiple frame surveys, along 
with the weight modifications induced by these estimators. 
For simplicity of notation, we concentrate on dual frame 
surveys in Section 2.1, and outline extensions to multiple 
frame surveys in Section 2.2. In a dual frame survey, we can 
write 

atte eel 5 tel, 


where Y, is the total of the population units in domain a, 
Y,, 1s the total of the population units in domain ab, and 
Y, 1s the total of the population units in domain b. A 
special case is estimating the population size N = N, + 
N.» + N,, as discussed in Haines and Pollock (1998). We 
discuss estimating population quantities other than totals 
and means, and using data from multiple frame surveys in 
other analyses, in Section 4. 

We first set out some desirable properties for estimators 


from multiple frame surveys. 


1. An estimator should be approximately unbiased for 
the corresponding finite population quantity. 

2. Estimators should be internally consistent: that is, if 
¥, estimates the number of female engineers in the 
population, 4 estimates the number of male engi- 
neers in the population, and Y, estimates the total 
number of engineers in the population, then we 
should have Y, + Y, = Y,. Internal consistency pre- 
serves multivariate relationships in the data. In 
practical terms, internal consistency requires that 
one set of weights be used for all estimates. 

3. An estimator should be efficient, with low mean 
squared error. 

4. An estimator should be of a form that can be 
calculated with standard survey software such as 
SUDAAN or SAS PROC SURVEYMEANS. This 
allows analysts to work with the data without 
having to write and test new software. In practical 
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terms, one data file is created from the multiple 
frame survey. The file includes one column of 
weights to be used for calculating point estimates, 
and it contains either variables describing the survey 
designs for formula-based variance estimation, or 
columns of replicate weights for replication-based 
variance estimation. 

5. An estimator should, if possible, be robust to non- 
sampling errors that might occur with multiple 
frame surveys. 


2.1 Estimators and weight adjustments for dual 
frame surveys 


Consider the overlapping dual frame survey depicted in 
Figure 2, where domain ab is nonempty. A probability 
sample S(A) of size n, is drawn from the N, units in 
frame A, and an independent probability sample S(B) of 
size n, is drawn from the NV, units in frame B. Unit 7 in 
sample S(A) has probability of inclusion n* and weight 
w“, and unit i in sample S(B) has probability of inclusion 
1 and weight we . The weights may be the inverses of the 
inclusion probabilities, or they may be poststratified to agree 
with population counts; it is assumed that estimators of 
population totals are sry unbiased. 

Then E[Yiescn Wy) © ¥+¥, and E[Dsayw, le 
Y, + Y,,. Consequently, an estimator that combines the 
observations from both surveys with the original weights, 
Diesc WY; + Diese), Vs iS biased for the population 
total Y. If the domain means differ, the corresponding 
estimator of the population mean may also be biased. 

The various estimators for the population total Y that 
have been proposed in the literature modify the weights so 
that the estimators are approximately unbiased. The modi- 
fied weights, shown below for the different estimators, are 


of the form w* = mw" and Ww? = m?w?. The population 
total is then estimated by 
‘ ae Pe 
PND te yoy, (1) 
ieS(A) ie S(B) 


and the population mean Y is estimated by Y=Y/IN 


where 


ieS(B) 


The estimators will be approximately unbiased, then, if 
m’ =1 for iea,m? ~1 for ie b, and mi +m? =1 
for i €¢ ab. All of the estimators reviewed in this section 
satisfy the criteria needed for approximate unbiasedness in 
the absence of nonsampling errors (see Lohr 2009). 


Fixed weight adjustments. The simplest weight modification 
to preserve approximate unbiasedness, described by Hartley 
(1962), takes 
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l ifiea eines 
a ee ii 
wis e if ieab, ee eyetann 


where 8 < [0, 1} Using | the modified weights Ww = 

m;,w;' and w? =m, w; in (1), the resulting estimator 

V8 (6) can also be See ue the eee domain 

totals Meare Lies(A)ies wiy,, Y= Lies(4), icab Wi Yip uns 

Dies(B), icap W, Y;, and ve = Vies(s), icp W, y;. The estimator 
Y(6) = S m9 wey; + hs my wy Jj 


ieS(A) ie S(B) 


=) POY (l= o)y. i. (3) 


thus estimates the domain total Y,, by a weighted average 
of the frame A estimator, ye ,,, and the frame B estimator, 
Yop 

For a fixed value of 0, the estimator Y (9) gives internal 
consistency since the same set of adjusted weights is used 
for all variables. The estimator is simple to use and 
implement. The efficiency of the estimator depends on the 
value chosen for 8. Brick etal. (2006) used 6 = 1/2 in 
their study of a dual frame survey in which frame A was a 
landline telephone frame and frame B was a cellular 
telephone frame, and the value of 8 = 1/2 is frequently 
recommended (see, for example, Mecatti 2007). When 8 = 
0 or 1, the data in the overlap domain from one of the 
samples are discarded and the survey becomes a screening 
dual frame survey. 


Optimal estimators. Hartley (1962, 1974) proposed 
choosing @ in (3) so that the variance of Y(9) would be 
minimized. The optimizing value of 0 is 


LVGS) HiCoWiE: 1) Cov ee) 


6 
i V(¥A) +V(P3) 


Since the variances and covariances are generally unknown, 
they must be estimated from the data, giving 
ee Vhs) + Cov (%,, Yes) = Cov (We', Yop) 

V (Yan) + Vion) 

Skinner and Rao (1996) showed that Hartley’s estimator 
can be calculated using adjusted weights. The weight 
modifications for Hartley’s estimator Y(6 ny) are given by 
(2), substituting 6,, for 8. Since 0,, is consistent for 0,,, 
Hartley’s estimator is asymptotically optimal among all 
estimators of the form Y“ + ¥,° + @Y4 + (1-6) Y?. The 
modified weights w,, and vw, are functions of the 
variances and covariances of estimated domain totals, 
however. This has two consequences: (1) the modified 
weights are random variables, and their variability needs to 
be accounted for in standard errors of estimators, and (2) the 
optimal weight modifications will differ for different 
response variables, leading to internal inconsistency. 
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Fuller and Burmeister (1972) proposed modifying 
Hartley’s estimator by using additional information about 
N b? giving 


Yen (B) = Y,+ ¥? + B,P-5 + (1 -B,) P24 B,(N4 — N72), 


As with Hartley’s estimator, the optimal values B,,,, and 
B»op¢ are chosen to minimize the variance of Y-p(B), and 
are thus functions of the covariances of the domain totals. 
Substituting consistent estimators pia and Boa gives the 
weight adjustments for w* and we . Lohr and Rao (2000) 
showed that the Fuller-Burmeister estimator Y., has the 
smallest asymptotic variance among the estimators con- 
sidered. As with the Hartley estimator, however, the modi- 
fied weights are random variables that differ for different 
responses, and in complex sampling designs the Fuller- 
Burmeister estimator is also internally inconsistent. 


Pseudo-maximum likelihood (PML) estimators. To achieve 
internal consistency Skinner and Rao (1996) proposed a 
pseudo-maximum likelihood (PML) estimator that uses the 
same weights for all variables. When N_,, is unknown, it is 
estimated by N. ee (9), which is the smaller of the roots of 
the quadratic equation 


(7 A (7B 
Lea -[edtea-o 88], 
N N 


B A 


+ ON4 + (1-0)NE = 0. 


Skinner and Rao (1996) suggested using the value 0, for 
8 that minimizes the asymptotic variance of N ae (8): 
Ny Ng V (Nii) 


0, = = = 
Was Nardat Ny Ng) ON») 


(4) 


Substituting an estimator 6 p for 0,, the weight adjust- 
ments are: 


A i yay 


0, if i € ab, 


Np ~ Newt 8p) if gives 
(7B 
m=? a N, 
i,P A 
Nn (Oy) 


6a ie 8,)Ne CE=6,yit 7 ead. 
If the value of 8, cannot be estimated, for example if the 
two sampling frames coincide or the design in Figure 1 is 
used, then one can use an average design effect from each 
survey in the adjustment, as described in Lohr and Rao 
(2006). The PML estimator is internally consistent; while 
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not guaranteed to give the smallest mean squared error, it 
has high efficiency in many survey situations. 


Single frame estimators. Bankier ( 1986) and Kalton and 
Anderson (1986) proposed estimators of the form in (1) that 
treat all the observations as though they had been sampled 
from one frame, with adjusted weights in the intersection 
domain relying on the inclusion probabilities for each frame. 
The weight adjustments for the Kalton and Anderson (1986) 
single frame estimator are: 


' ifica 


B A Bye. 
w, | (we + w, )if i € ab, 


ae 


a 1 ifieb 
Le ae A By ce . 
w; | (we + w, )if ie ab. 


If wi =1/n4 and w =1/ 77, the single frame 
estimator uses ws = wi, =1/(n4 +2) for units in 
ab. The weight adjustment in domain ab relies on both 
nm’ and nt). Thus if a disproportionate stratified random 
sample is taken from frame B, one must know the frame-B 
stratum membership for units sampled in S(A). The 
adjusted weights from the single frame estimator can be 
interpreted in terms of inclusion probabilities for sampled 
units. If the sampling fractions are small, Wis is approxi- 
mately 1 /P(unit i is included in one of the samples). If 
each of S(A) and S(B) is self-weighting, then the single 
frame estimator reduces to (3). 

The single frame weight modifications are the same for 
all response variables, so estimators are internally con- 
sistent. For complex surveys, however, single frame esti- 
mators may not be as efficient as the optimal or PML esti- 
mators. Their performance may be improved by raking 
toward the frame population totals (Skinner 1991). 


Pseudo-empirical likelihood (PEL) estimators. Rao and Wu 
(2010) proposed empirical likelihood estimators for dual 
frame surveys. Using 0 = 0,, the empirical log likelihood 
function is defined by 


(p,. Pi Pp), P,) = 


N 
BS) yates) 
N i€S(A), ica 4V 
0,N, 
it Sy wi log(p5,,) 


ieS(A), icab ab 


> Xb w? log(p,,) 


ieS(B),ieb 4V 4, 


+ pte wt tate8} 


i€¢S(B),ieab ab 
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where 9, is given in (4). An ae 6, is substituted if 
8, is unknown. Then ¢(p,, p‘,, P_»» P,) is maximized 
subject to 


Dawe ae. Pai 


De, Stated Poi 


> Poids = 


ieS(A),ieab 


ne A tates 
I, De ronay Pai I, 


= Baas 
=a, aati Pabi I, 
and 


SLAB, (5) 


ieS(B),ieab 


When JN, is unknown, the PEL weight modifications are 


a Paty, -NiM-(6,)| ified 
im? = we 
fied ed ed By A 
6, ot Ne Op) if i € ab, 
V; 
pe tl ie 
Pui tn, - NEMU(G 3) ) Oat tite B 
ih pe - 
a-6 ,y Pat ee rir 1 = ab. 


Vi 


The constraint in (5) changes the weights in the overlap 
domain so that the estimator of Y,, from S(A) is forced to 
equal the estimator of Y,, from S(B). This constraint, 
however, results in a different set of weights for each 
response variable. The PEL estimator thus is not internally 
consistent. Rao and Wu (2010) presented an alternative 
multiplicity version in which the weight adjustments do not 
depend on y; in the absence of auxiliary information, this 
estimator is the same as Y(1/2) in (3). 


2.2 Weight adjustments with three or more frames 


In the general case, suppose there are Q frames, denoted 
A,,..»A 9. Let S(A,) denote the probability sample from 
fate a for q =e ao Unit i in sample S(4,) has 
prababiliy of aetiien TC; “| and weight wi, There are 
a total of D distinct domains. 

A multiple frame estimator generalizing (1) is of the form 


3 Q 
Y=>) DY mw y, 
q=l ieS(Ag) 

where m4 is the weight adjustment for observation 
i in S(A,). A fixed weight estimator sets weight ad- 
justments i ‘479 for each frame and domain, with the 
constraints that m“@” >0 (m “r is assumed to ee 0 
if domain d is not part of frame 4, ) and We 4) — 
! for d=, Dy then; m= me i sie obser- 
vation i from S(4A,) is in domain d. A simple choice, 
which generalizes the fixed weight dual frame estimator 
Y(1/ 2) in (3), takes mr? = =[1/number of frames that 
contain domain d]; this is called the multiplicity esti- 
mator by Mecatti (2007). Other choices include setting 
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m“”” =1 in exactly one frame and 0 for the other 


frames, resulting in screening estimators. 

Many of the properties from the dual frame situation 
extend to the case of three or more frames; multiple frame 
versions of the estimators in Section 2.1 were studied by 
Hartley (1974), Lohr and Rao (2006), and Mecatti (2007). 
How do the multiple frame estimators satisfy the criteria set 
out at the beginning of this section? All of the estimators — 
fixed weight, optimal, PML, PEL, and single frame are 
approximately unbiased for population totals when suffi- 
ciently large samples are taken in the frames. The fixed 
weight, PML, and single frame estimators are internally 
consistent; the optimal Hartley-type and Fuller-Burmeister- 
type estimators in Lohr and Rao (2006) and a multiple- 
frame extension of the PEL estimator of Rao and Wu (2010) 
are not internally consistent. While the optimal estimators 
are asymptotically efficient, they are often unstable in small 
or moderate samples with three or more frames because the 
optimal estimated weight modifications are functions of 
large estimated covariance matrices. The optimal and PEL 
estimators are ill suited for use with standard survey soft- 
ware because they require a different set of weights for each 
response variable. 

We recommend that one of the internally consistent 
estimators — fixed weight, PML, or single frame ~ be used in 
practice. Lohr and Rao (2006) concluded that the PML 
estimator has small mean squared error in many survey 
circumstances, and thus is a good choice for a survey that is 
conducted only once. With repeated surveys, though, the 
simplicity and transparency of a fixed weight estimator may 
be preferred. Fixed weight adjustments may make year-to- 
year comparisons easier in an annual survey where the 
domain proportions are relatively constant over time. Fixed 
weight estimators are also more amenable to weight 
adjustments for nonresponse and domain misclassification 
(see Sections 5.1 and 6.1). If fixed weight adjustments can 
be chosen that are close to the optimal weight adjustments 
for important responses, perhaps by using estimated design 
effects from previous surveys, the fixed weight estimator 
will have mean squared error close to that of the optimal and 
PML estimators. 


3. Postratification to population controls 


All of the estimators in Section 2 modify the original 
sampling weights. As a result, some properties of the 
original weights may be lost. For example, if a stratified 
random sample is taken in frame A, the modified weights 
will not necessarily have the property that the sum of the 
weights in a stratum equals the stratum population size. 

Bankier (1986), in the original development of single 
frame estimation methods, suggested raking the single 
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frame weights, Ww, and w*,, to stratum totals so that the 


adjusted weights w;"" and ws") satisfy 


~ A,adj ~ B,adj\ _ 
aren i Bs a ul\aae > 


where S ,, represents the sampled units from either frame in 
stratum A of frame A, and N,, is the population size of 
that stratum. Bankier (1986) and Skinner (1991) used raking 
ratio estimation to calibrate single frame estimators to the 
frame population sizes N, and N g- Kott, Amrhein and 
Hicks (1998) proposed using the least squares calibration 
methods of Deville and Sarndal ( 1992) for calibrating 
weights to population totals such as stratum sizes. 

For the PML estimator, Lohr and Rao (2000) recom- 
mended combining the samples first and then using calibra- 
tion methods to adjust to population as well as separate- 
frame population totals. When nonresponse is present and a 
fixed weight estimator is used, Brick, Cervantes, Lee and 
Norman (2011) concluded that it is preferable to poststratify 
the individual samples first, and then combine the samples. 
In some situations, it is most efficient to poststratify both 
before and after combining samples; in other situations, 
poststratification can increase bias (see Section 6). Deci- 
sions about poststratification need to be made based on the 
mean squared error, which includes effects of nonsampling 
errors, and not just on the sampling variance. 


4. Analyzing multiple frame surveys 
with survey software 


4.1 Point estimation with survey software 


Only internally consistent weight adjustments are suit- 
able for use with survey software when there are multiple 
responses of interest. Each of the internally consistent 
methods presented in Section 2.1 results in one vector of 
adjusted weights for each sample. These may then be 
concatenated to form one vector of weights: Ww = [a 
bed (Ayes, 2 sive S(Ap)]. Let y be the corre- 
sponding vector of observations, formed by concatenating 
the observations from samples S(A,) through S(Ap). 
Then Y =W’y. From a user’s perspective, once the 
modified weights are constructed, the procedure followed to 
find point estimates of population totals and means is the 
same as in a single frame survey. 

The modified weights from an internally consistent 
procedure can be used to estimate any population quantity. 
Let F(y) be the cumulative distribution function for the 
population, with 


N 
F(y) = DIO; < »/s 
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where J(y, < y)=1 if y, < y and 0 otherwise. In a 
single frame survey, F(y) is estimated by the empirical 
cumulative distribution function 


F(y) = Dewy, = Dee. 


The modified weights may be used to estimate F (y) ina 
multiple frame survey: 


, Q Q , 
BOs Dewi (ns Vikods aie. 


q=l ieS (Ag) q=l ieS (Ag) 


The denominator is approximately unbiased for N , and the 
numerator is approximately unbiased for >, / Uy sey). 
Any functional of the cumulative distribution function may 
then be estimated using F’ (y): the mean, [ydF(y), the 
median m satisfying F(m) ~ 1/2, or any other quantity. 

Since the estimators with modified weights are approxi- 
mately unbiased for population means and totals, they are 
also approximately unbiased for smooth functions of popu- 
lation means such as ratios and regression coefficients. Any 
population quantity that could be estimated using the 
weights from a single frame survey can be estimated analo- 
gously using the adjusted weight vector for the multiple 
frame survey. 


4.2 Variance estimation with survey software 


Knowledge of the survey designs is needed to calculate 
standard errors. Variance estimation is straightforward for 
the estimator in (3), where the weight adjustments do not 
depend on the data. In that situation, 


V{Y(0)] = V Ds ue + v| oe wn), 
ieS(A) ie S(B) 


where Ww" and w? 


; are defined below (2). Create the data 
set by concatenating the observations from S(A) and S(B) 
as in Section 4.1, using w* and Ww as the weights. Define 
the stratification variable for the combined sample as the 
combination of categories given by the frame indicator 
variable, the frame-A stratification variable, and the frame- 
B stratification variable. Define the first-stage clustering 
variable for the combined sample similarly as the combina- 
tion of categories of the individual frame clustering vari- 
ables. Then, standard survey software may be used to esti- 
mate population means and totals using the modified 
weights, and to estimate variances using the stratification 
and clustering variables from the combined samples. 

Variance estimation is more complicated when the 
weight modifications m“ or m? depend on quantities that 
are estimated from the sample, as in the PML estimator, or 
when the combined sample is poststratified or calibrated to 
population quantities. Linearization, jackknife, and boot- 
strap methods may then be used to estimate variances. 
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In the following, we summarize methods that can be used 
for variance estimation if the psus from the frames are 
selected independently. If samples from the different frames 
share psus, other methods must be used. If, for example, 
psus are selected from the population, and a dual frame 
design is used within each selected psu, point estimators for 
psu totals can be calculated using one of the methods 
described in Section 2. Then standard replication methods 
can be used to calculate a with-replacement variance esti- 
mator. 

Under regularity conditions, the linearization and jack- 
knife methods are consistent for estimating the variance of a 
population characteristic t that can be written as t = 
g(A, B), where A is a vector of population totals from 
frame A, B is a vector of population totals from frame B, 
and g is a twice continuously differentiable function 
(Skinner and Rao 1996; Lohr and Rao 2000). The vector A 
is estimated from S(A) by A, with estimated covariance 
matrix 4; similarly, B estimates B from S(B), with 
V(B) = >,. The linearization estimator of the variance of 
t =g(A,B) is 


Pith So re, Peale ees 


where g, is the vector of partial derivatives of g(A, B) 
with respect to the components of A and g, is the 
corresponding vector of partial derivatives for frame B. 
Demnati, Rao, Hidiroglou and Tambay (2007) derived 
linearization estimators of the variance for multiple frame 
surveys by taking derivatives of a function of the weights 
rather than of the means. Linearization methods require that 
the derivatives be calculated separately for each estimator 
that is considered, and these calculations can be cumber- 
some. For that reason, it may be preferred to use replication 
methods if multiple frame surveys are adopted. 

Suppose a stratified multistage sample with H strata is 
taken from frame A, where stratum / has fi; primary 
sampling units. An independent stratified multistage sample 
with ZL strata is taken from frame B, where stratum / has 
ni primary sampling units. The jackknife estimator of the 
variance can be calculated by creating a total of D7, in + 
Yi, fi replicate weight columns (Lohr and Rao 2000). 
The replicate weights for the column corresponding to the 
deletion of psu i from stratum h in S, are formed by: 


~ A 


ig Rl den Pees : 
— w¢ if unitk is in stratum h but not in psui, 
~ A nj, -_ 
Wien . veods : 
0 if unitk is in psui of stratumh, 
we if unitk is in stratum gh. 


The jackknife coefficient for this column is the multiplier 
(7; —1)/ Ae The column of replicate weights corre- 
sponding to the deletion of psu j from stratum / in S; is 
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formed similarly, with jackknife coefficient (a? —1)/ nv. 
With more than two frames, additional columns of replicate 
weights are added corresponding to the deleted psus from 
those samples. Weights for a bootstrap method of variance 
estimation (see Lohr 2007) can be defined similarly. 

Multiple frame replication variance methods can be used 
with standard survey packages that allow replicate weights. 
If desired, each column in the replicate weights can be post- 
stratified to population and frame totals, so that the post- 
stratification is accounted for in the variance estimation. 

One challenge with replication variance methods is that 
the number of columns of replicate weights needed may be 
very large if a simple random sample or stratified random 
sample is taken in one of the frames. For the bootstrap, we 
have found that for some surveys at least 500 bootstrap 
iterations are needed for variance estimates with dual frame 
surveys, which again may be excessive. It is possible that 
combined strata variance estimation, as discussed in Lu, 
Brick and Sitter (2006), may be used with multiple frame 
surveys to reduce the number of replicates needed. 


5. Nonsampling errors 


Multiple frame surveys often have better population 
coverage than a single frame surveys. When all frames are 
incomplete, as in Figure 3, any one of frames A, B, or C, if 
used as the sole sampling frame, would have severe 
undercoverage. The multiple frame survey design ensures 
that all units in the overlapping frames have a positive 
probability of inclusion. 

Like all surveys, multiple frame surveys are subject to 
nonsampling errors. They have nonresponse, which may 
differ in the different frames. While the union of the frames 
may have better coverage than a single frame, there may 
still be undercoverage of the target population. Estimators 
for multiple frame surveys are also sensitive to domain 
misclassification and biases that might result from different 
administration methods or modes in the component surveys. 
We discuss nonresponse and mode effects in this section, 
and study effects of domain misclassification in Section 6. 


5.1 Nonresponse 


In any survey, nonresponse can result in biased estimates 
of population totals and other quantities. Different non- 
response rates in the samples from the two frames can affect 
the point estimates of the population total given in Section 
2; additionally, nonresponse can affect the weight adjust- 
ments prescribed by some of the methods. 

Kennedy (2007) discussed a problem that has occurred 
when frame A consists of landline telephone numbers and 
frame B has cellular telephone numbers: the units in the 
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intersection domain ab who were interviewed by cell 
phone differed from those in ab who were interviewed on 
the landline phone. For example, it was estimated that 18% 
of the intersection units were aged 18-25 in the frame-B 
sample, while it was estimated that only 8% of the inter- 
section units were aged 18-25 using the frame-A sample. 
The difference was ascribed to nonresponse: it was thought 
that persons who predominantly use cellular telephones (and 
thus are difficult to reach through a landline survey) tend to 
be younger. Kennedy (2007) suggested raking using esti- 
mated relative telephone usage (i.e., whether most of calls 
are on landline or cellular telephone). 

Brick etal. (2011) proposed two methods for non- 
response adjustment in dual frame cellular/landline tele- 
phone surveys with fixed weight estimators. They consi- 
dered a setup in which the overlap domain has two groups: 
households that receive all or nearly all of their calls on 
cellular telephones (cell-mainly), and the remaining house- 
holds in the overlap domain (landline-mainly). The first 
method, which does not require external estimates of control 
totals, sets the value of 0 in the fixed weight adjustment 
estimator to reduce the nonresponse bias by using the 
response rates for the cell-mainly and landline-mainly 
households in each sample. The second method requires 
poststratification control totals for the cell-mainly and 
landline-mainly groups in the overlap domain, N,,» and 
N,,, and estimates the population total in domain ab by 


= IN oe 5 Noa 5 
x0, a 13 gus (ho 0.) ae ia 
eal gab gab 


where 6 “» ‘represents the estimated total of group g in 
domain ab from S(A), the other totals are defined 
similarly, and 0 < Cte iors =r 2: 


5.2. Mode effects 


In some cases, multiple frame may also mean multiple 
mode. De Leeuw (2008) compared the advantages and 
disadvantages of different sampling modes, and summarized 
empirical research on mode biases. Persons may give 
different responses when presented with questions in a 
visual form than when presented with questions in an 
auditory form, resulting in mode bias. Mode effects that 
occur in single frame surveys will also occur in multiple 
frame surveys. If different modes are used in different 
frames, it is challenging to separate mode effects from other 
nonsampling errors. 

Many of the multiple frame survey estimators combine 
estimates from the overlap domains, and these methods 
assume that the estimators of Y,, from the component 
surveys both estimate the same quantity. If, however, the 
frame A survey is conducted in person while the frame B 
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survey is conducted by telephone, it is possible that a census 
of the domain ab from frame B would give a different 
domain total than a census from frame A. 

One possibility to investigate mode effects is to conduct 
the frame B survey using a split sample, e.g., partly in 
person and partly by telephone, but that would reduce the 
cost savings from the dual frames. Careful pretesting can 
mitigate the mode effects. Research is needed in this area; 
the same problem of mode effects, of course, occurs in 
single frame surveys such as the American C ommunity 
Survey in which nonresponse follow-up is done by different 
mode than the original sample (see Citro and Kalton 2007). 
The methods presented in de Leeuw, Hox and Dillman 
(2008) for designing surveys for multiple modes also apply 
in the multiple frame setting. 

Vannieuwenhuyze, Loosveldt and Molenberghs (2011) 
presented a method for distinguishing mode effects from 
selection effects when a supplemental single-mode survey is 
available. They noted, however, that the method requires the 
strong assumption that the coverage and nonresponse errors 
are equivalent for both surveys. If this assumption is met for 
a dual frame survey so that the samples in the overlap 
domain from frames A and B represent the same population, 
and if domain classification is correct, the mode effect can 
be estimated from the overlap domain as Da= me = ¥ ae 
A difference that is significantly different from 0 indicates 
presence of a mode effect if there are no other nonsampling 
errors. If other nonsampling errors are present, a large value 
of D,, does not provide information about the cause of the 
difference; experimentation is needed to distinguish possible 
causes. 


6. Domain misclassification and bias adjustment 


The estimators discussed in Section 2 construct weights 
for the observations based on domain membership. Thus in 
the estimator Y(0) in (3), the weight multiplier of an 
observation from sampling frame A is 1 if the observation is 
in domain a, and is @ if the observation is in domain ab, 
in order to account for the multiplicity of sampling. 

In practice, domain membership may not be clear. For 
the situation in Figure 1, it may be unknown whether a 
respondent in an area frame also belongs to the list frame. If 
frame A is an area frame and frame B is an internet frame, 
for example, the only way to determine whether an 
individual sampled from frame A is also in frame B may be 
to ask the person about internet access, and the person might 
not give the correct response. 

If matching or record linkage is used to determine frame 
membership, imperfect matching can also misclassify obser- 
vations. Lesser and Kalsbeek (1999) discussed nonsampling 
errors that occur in dual frame surveys that have been 
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conducted by the U.S. National Agricultural Statistics 
Service. Domain misclassification can occur if a farm 
sampled in the area frame is incorrectly classified with 
respect to its list frame membership. In landline/cellular dual 
frame telephone surveys, it is challenging to determine 
whether a person in one frame is also in the other frame 
(Kennedy 2007). A person reached in a landline telephone 
sample may also have a cell phone, but rarely take calls on 
the cell phone. While technically in the overlap domain, that 
person is virtually unreachable in the cell phone survey. 
Some landline/cellular surveys ask respondents about the 
relative amounts of cellular or landline telephone usage, but 
misclassification can occur. 

In practice, we expect domain misclassification to be 
related to responses of interest; we also expect that in many 
situations, misclassification is more likely to occur in certain 
directions. In longitudinal dual frame surveys, domain 
misclassification can have greater effects than in cross- 
sectional surveys (Lu and Lohr 2010). In some situations, 
the domain indicator can be missing or unavailable. Clark, 
Winglee and Liu (2007) investigated logistic regression and 
record-linkage methods for predicting the domain of an 
observation with missing domain information. 


6.1 Misclassification bias adjustments 


If domain misclassification is severe, each method for 
modifying the survey weights to adjust for multiplicity can 
result in biased estimates of population quantities. In this 
section we derive a correction for the domain misclassifi- 
cation bias of the fixed weight estimator of Section 2.2 
when misclassification probabilities are known. Let the D - 
vector 84 denote the true domain membership for 
observation i of frame 4,, containing a | in position d if 
observation i is in domain d, and 0 elsewhere. Let 
Y =(¥,..., ¥,)' denote the vector of population totals for 
os D domains. For an overlapping dual frame survey, 


= (Y,, Y,,, ¥,)'; for a three-frame survey, Y = (Y,, 
; Y,, y ‘por Yn» Yoo» Y,)'. If there is no domain misclassi- 
fication, 


4g 4g 44 
Y = Diet 9; Waa 


is the corresponding estimator of Y from S(A,). For fixed 
weight adjustment vector m “4 = (m ea a ia 
frame A,, satisfying agn me hey then Eye e (m“ iy 
y")= Y : 

Now suppose there is misclassification. Let n.! denote 
the observed classification for observation i in S. We can 
write 7% = (M“")'8", where M*" isa Dx D matrix 
containing a | in position (d, e) if observation 7 in true 
domain d is (mis)classified to domain e, and 0 elsewhere. 
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To allow for differential misclassification within do- 
mains, we posit a structure in which the misclassification 
probabilities can differ for subpopulations in a frame. In a 
landline/cellular survey, for example, some age groups may 
be known to have higher misclassification probabilities than 
others. Chambers, Chipperfield, Davis and Kovatevi¢ 
(2008) used a similar grouping approach to correct for 
record linkage errors. Suppose the population can be divided 
into G groups, g = 1, ..., G, in which the Berto 
probabilities are naw for each frame A,. Let de! (d, e) 
denote the probability that an observation a group g with 
true domain d_ is classified into domain e in sample 
S(A,), and let ®;4 be the DxD matrix with entries 
62" (d, e). For observation i belonging to group g and true 
domain d, assume that row d of M“ is generated as a 
multinomial random variable of size 1 with probabilities 1 in 
row d_ of the expected misclassification matrix o;" , and 
that all M“ are independent of each other and of the 
sample inclusion variables. We thus have G matrices at 
misclassification probabilities for frame A, 0! ee , O44 
Denote the vector of population totals for group g by 
Y(g) = 5%,8,4 x,(g)y, where x,(g) = 1 if observation 
i isin group g and 0 otherwise. 

With the observed domain classifications nif ; ithe 
design-weighted estimator of the vector of domain totals in 


group g 1s 


I 


ned ; 
Y “nis, 2) y, (gw, Vi 


ieS(/ 


a 


- 


A A A 
SD (M;*)' 5;7x;(g) W732 
ieS(A_) 


so that E[Y“ (mis, g)] = (®f")' ¥(g). 
Now consider a new vector of weight adjustments 
mn! = mn”, We oa), for group g in frame 4,. 


Then 
S Sate oe S S044)! 

E Sai m,’ )’ Y (mis, g) =) Y(@® m,’) Y(g). 
galigat g=l qg=l 


Since D9) D2 (mn ““)'y(g) = Y, the bias will be elimi- 


nated under this model when 
mg! = (Oe!) m™, (6) 

where (@; 7)* is the Moore-Penrose inverse of @,;", 
obtained by taking the inverse of the nonzero rows and 
columns of ®, “F 

Replacing weight adjustments m =: by m,! eliminates 
the bias under the multinomial misclassification model but 
inflates the variance. For frame 4,, 
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Dut (m,’)' Y 7 (mis, g) 
A A A 
i lr(De Duiest4,) {(®,*)"m 7} (M,7)' 
A A 
8,7 1s(8) Wi", | SCA)» SC) }] 
A fs A , A i 
ch rede Duiest4,) (D4) m *} (M;") 
A A 
8, (8)W/"9, | (Aone S(4p)) 
A é A , A 2 
=>%, (@,")'m*) B| roe r(e)(w/y,) 
. A A A A A A A A 
{diag [(®,*)'6; 7] - (®,7)'5;7 yo") | (®,7)"m Z 
“ apr be (m‘" i 89 wi" | 


The second term is the variance of the contribution from 
frame A, when the units are classified correctly. The first 
term is zero only when 04 is diagonal for all g, i.e., there 
is no misclassification. 

The weight adjustments in (6) may be extended to the 
case in which the original fixed weights m“! vary for the 
groups, as long as phe mor = 1 for each domain. Note 
that the bias correction method in this section is proposed 
only for the fixed weight estimators, and not for the PML, 
PEL, or optimal estimators where the multiplicity weights 
depend on the data. The bias correction depends on the 
correct specification of the misclassification probabilities. If 
the misclassification probabilities are estimated from 
another survey, the operational methods of the surveys must 
be similar. 


6.2 Simulation study 


Lohr and Rao (2006) found in simulation studies that the 
PML estimator has smaller mean squared error than the 
other estimators when random misclassification is present, 
but this is due largely to the smaller variance of that 
estimator. To study sensitivity of estimators to other forms 
of domain misclassification, we performed a simulation 
study for two- and three-frame surveys. The population for 
domain d was generated using the model Vy; =n 
Oe €, fonui= lie Ny and 7 S11) "4,55. withoa, ~ 
N(0,1) and €, ~ N(0,1) generated independently, and 
then probability samples were drawn from this population. 

For the two-frame study, the domain means are 1, = 
—l, u,, = 9, u, = 2 and factors in the simulation are: 


1. Sample size: 100 or 200 from each frame. 


2. Cluster sample or simple random sample drawn 
from frame A. A cluster sample was drawn by 
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selecting a simple random sample of 1,/5 of the 
groups used in generating the population. 

3. Misclassification probabilities for frame A (all 
probabilities not listed are 0): 


A : : 5 : 
a. %,= 1, $4, 4, = | (no misclassification); 


b. 4, - 0.9, 2 os = Ol Cee = 1; 
C. Oia a Oe, 0. ab a Ot Co oe 0.9, 5.4 = 0.1; 


iW Sei bc0,a6 = 0.9, en 
4. Misclassification probabilities for frame B: 
a. 0,,=1, 6%, a» = | (no misclassification); 
BL 5520.8) pad = 0.2, Dap.ab val; 
C. sp = 0.8, sap = 0.2, e,ab = 019; ban,p = 0.1; 
d. sp aaa ab,ab = 0.8, a0, = 0.2. 
5. Population sizes: N,=N,=N,, =25,000; N,= 
N,, = 10,000, N,,=55,000; N,=25,000, N,,= 
40,000, N, = 10,000. 


Ten thousand replicates were run for each combination 
of the factors, giving the Monte Carlo estimate of bias a 
standard error of approximately 100. We studied all 
estimators in Section 2, including Y (U2). ¥ (2/3), and 
Y (1) from (3). We also examined poststratified estimators 
that could be employed when the domain population counts 
N,, N,,, and N, are known: estimators with subscript 
“postl” apply poststratification to the two samples first and 
then combine the samples, and estimators with subscript 
“post2” combine the samples first and then poststratify to 
the domain population counts. The bias corrected estimators 
Y(1/2),. and Y¥(2/3),. modify the initial fixed weights 
corresponding to 9 = 1/2 and 0 = 2/3 using (6). With 
misclassification pattern (b) in frame A, for example, the 
bias-corrected weight adjustments for Y (Ta2), a. sate 
m = 19/18 for i classified in a and m* =1/2 for i 
classified in ab; for pattern (c), the bias-corrected weight 
adjustments are 17/16 and 7/16, respectively. The single 
frame estimator is omitted from these tables since it is the 
same as either Y(1/2) or Y(2/3); the single frame 
estimator raked to the population totals N, and N, is 
denoted by ae rake: Lables 1 and 2 display results for 
n,= 100, n,=100, N,= N,, = N, = 25,000, and a simple 
random sample from frame A; Tables 3 and 4 give results 
for n, = 200, n, = 100, NV, = N.,, = N, =25,000, and 
a cluster sample from frame A. The general patterns of 
results are similar for the other simulations and are not 
shown here. 

First, consider the fixed weight estimators. The bias- 
corrected estimators reduce the bias as expected; in all cases 
studied with misclassification, the empirical bias from the 
bias-corrected estimators was less than 200 in absolute 
value, which is within the margin of error. Although the 
standard deviation for the bias-corrected estimators is higher 
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than for the uncorrected estimators, in most cases the mean 
squared errors are comparable. 

The screening estimator Y(1), which discards units from 
frame B in domain ab, exhibits no misclassification bias 
when frame B is correctly classified. It also exhibits no bias 
in Tables 1 and 3 with frame-B misclassification pattern (d) 
because the observations misclassified from domain ab to 
domain b have mean 0; for different sets of domain means, 
pattern (d) does create bias. For the other cases, the 
screening estimator has the highest bias. For every misclas- 
sification pattern, the screening estimator has high mean 
squared error because data are thrown away. If the domain 
means are similar, then the misclassification might not result 
in appreciable bias but discarding observations from domain 
ab in S(B) would greatly increase the mean squared error. 

Poststratifying to the domain totals when there is mis- 
classification often increases the bias instead of decreasing 
it. Consider line 4 of Table 1, where 20% of the S(B) 
observations in ab are mistakenly classified into domain 
b. The weights of the observations that are really in domain 
b, with mean 2, are reduced from 500 to approximately 
417, which causes the poststratified versions of Y(1/2) to 
be biased. The effect of poststratification on the mean 
squared error is mixed, and depends on whether the variance 


Table 1 


Lohr: Alternative survey sample designs: Sampling with multiple overlapping frames 


reduction achieved by poststratifying exceeds the additional 
bias that can be introduced. Raking to the frame totals N , 
and N,, in re vaxe> aS similar effect on misclassification 
bias as poststratification. 

For the simple random samples in Tables | and 2, the 
PML and PEL estimators often exhibit much more bias than 
the uncorrected fixed weight estimators. The relative 
contributions from the two frames for these methods depend 
on the estimated variances of N4, and N2, the domain 
weights depend on NIM", and these two factors interact in 
complex ways depending on the misclassification structure. 
For misclassification pattern (d) in either frame, Ne eas 
too small because observations in domain ab are mis- 
classified; consequently, the weights for the observations in 
the nonoverlapping domains are too large. A poststratified 
version of the PML estimator shared the bias problems of 
the fixed weight poststratified estimators. The PEL esti- 
mator, by forcing the estimators of Y,, to be equal, can 
worsen the bias. For example, in the simulation in line 3 of 
Table 1, with correct classification for frame A and pattern 
(c) for frame B, the PEL bias is 50% larger than the PML 
bias. In this case, the PEL estimator pulls the unbiased 
estimator Y4 from S(A) toward the biased estimator from 
frame B. The optimal estimators also exhibit high bias. 


Estimated bias for dual frame misclassification, with n, =, = 100 and a simple random sample taken from each frame. MPA and 


MPB refer to the misclassification patterns for frames A and B 


855 555 OOO 


MPA MPB | Y(1/2) Y/2)postr YC/2post2 YU/2ge ¥(2/3) ¥(2/3) YO Vee rye yon pre cee 
a a -194 287 -87 -194 215 -215 1958 68 LO. MALE xh9 -163 
a be | 251015 4,145 4,529 5 76678 [7 (10002! 5,417" 1248" oaeen S42 2,361 
a ev itest42 18 -898 -128" 6,823 -138 -10,185 -5,413 -2,583 -1,650 -2,482 —-1,690 
a d 257 -8,430 -8,431 -47 -69 -55 -92 30-6576, 6,723 96,725. 6,195 
b a 1,163 -1,238 -1,290 -82 748 -82 82. 1.355. 2376. Neils ©0551 meg 0e 
b bl 3,724 3,040 3,264 Usiacmne 50784 65 9,905 -3,967  -920 30) e850 -100 
b G esegse 2,192 Si [24 5.077 136 #1<10,167. 3.954. 4-4319- 23. 8210e AAT 63.853 
b d 1,322 -9,445 -9,621 92 917 104 1OSe08600 4-8.219),7-8.9720) 17-8531 aler-8.899 
c a 1,366 1,315 1313 123 969 140 174) 1,530 y) 1:529 0 16325 11.355 1,276 
c ae lrect ps 5,456 5,948 51 —--5,801 64 9,945 -4,216 2,096 3,500 2,391 3,355 
c c | -3,797 235 512 Pisa 868 DAE 10011 24089) SUSTT ENG, e118 -466 
c d 1,285 -7,072 OAD 56 873 60 48 1,535 -4,665 -5,131 -4,976 —-5,222 
d a -120 2,134 2,134 bh t -132 -126 -155 JMB PE s536 1 3is88 3,470 
d b | -4,979 6,497 7,086 34. -6,620 65 9,901 -5,599 4,339 5,928 4,788 5,697 
d oo es82 1,174 1,644 -137  -6,835 52 = =10200 =5:622" 310” 1,626 578 1,540 
d fr} 90 -5,999 -5,998 107 98 119 iad? © 107°" 22.964 23-116 7-3,120. 3155 
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Table 2 


Estimated ./MSE for dual frame misclassification, with n, = npg 


MPA and MPB refer to the misclassification patterns for frames A and B 


MPA MPB | F(1/2)_ ¥(1/2)posti VL i2y o> E(Li2)eeakC/3)) YOM. 
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= 100 and a simple random sample taken from each frame. 


a a 9,646 TRH) 7,910 9,646 9,729 
a b 10,602 9,351 9,531 9,926 11,531 
a c 10,779 8,622 8,603 10,071 Si 
a d 9,789 11,719 11,704 9,674 9,884 
b 9,623 8,182 8,185 9,718 9,686 
b b 9,955 9,054 ONS 7 1905 10,949 
b c 10,146 9,014 9,014 10,160 MIS LOT7, 
b d 9,868 12,600 12,716 9,826 9,952 
c a 9,843 8,185 8,180 9,887 9,853 
c b 10,049 10,113 10,396 10,039 11,029 
c c 10,247 8,701 8,718 10,254 11233, 
C d 10,021 10,861 10,936 9,966 10,068 
d a O5195 8,127 8,121 9,734 9,845 
d b 10,718 10,601 10,970 10,001 11,602 
d c 10,847 8,558 8,650 10,099 11,769 
d d 9,945 10,070 10,057 9,778 10,019 


¥(1) Vy Yorn Yep, rake 
9,729 10,304 9,677 8,151 8,081 8,115 8,075 
10,197 14,181 PISS 8,212 8,377 8,198 8,311 
10,402 14,376 11,243 8,817 8,514 8,720 8,508 
9,795 10,432 9,819 10,979 10,978 11,003 11,007 
9,766 ~=—-:10,307 9,780 8,446 8,447 8,444 8,459 
10,212 14,069 10,489 8,074 7,913 7,995 7,898 
10,489 14,404 10,616 9,443 9,108 9,448 9,114 
9,927 10,567 10,023 12,063 12,284 12,188 12,371 
9,877 10,341 9,991 8,516 8,417 8,442 8,402 
LO229E 514 127, 10,662 8,520 8,863 8,529 8,778 
10,534 14,306 10,799 8,762 8,527 8,669 8,516 
10,016 10,579 10,177 10,113 10,211 10,168 10,240 
9,788 10,343 9,829 9,158 9,024 9,042 8,991 
10,258 14,149 11,358 9,461 10,157 9,595 9,986 
10,426 14,387 11,424 8,674 8,707 8,608 8,664 
9,885 10,510 9,986 9,458 9,412 9,449 9,417 


When a cluster sample is taken from frame A, as in 
Tables 3 and 4, the bias patterns are similar. When there is 
no misclassification, the MSEs of the optimal and PML 
estimators are smaller than that of Y(2 /3) because they 
account for the survey design. With misclassification, 
though, the MSE advantage is reduced because of the 
increased bias. 

To study misclassification with a three-frame survey, we 
selected simple random samples from each frame, and had 
correct classifications for frames B and C. Table 5 shows 
results for a simulation with three frames and a simple 
random sample of size 200 from each frame. The population 
was generated with N, =10,000 in each domain and 
domain means p1,= 1, Hy = 2, Hy = 35 ane = 45 Hy = 
5, H,, = 6, u, = 7. In this simulation, frames B and C are 
correctly classified, and the misclassification patterns for 
frame A are given in the table. We also studied other 
domain means, population domain sizes, and sample sizes 
using a factorial design; results for the other settings showed 
a similar pattern and are not shown here. The multiplicity 
estimator Y,.., with m, =1 for i € {a,b,c}, m, =1/2 
for i € {ab, ac, bc}, and m,=1/3 for ic abc, is 
optimal when there is no misclassification, and it equals the 
unraked single frame estimator. The other fixed weight 
estimators studied are Y,,,, with m2 = m3 = m\C° = 
if m*%) = m4 = m* abe) _ oe m2) = mo) = 
1/3, and m= mS) =1/6, and the screening 
estimator Y, with m 42 = m2 = mo = m4) = 


A ie B,be 
48) = pg A280) — py (Bb) = 


As with the two-frame study, the bias-corrected esti- 
mators are approximately unbiased. The screening estimator 
is also approximately unbiased since only S(A) is misclas- 
sified. The other estimators all exhibit substantial bias with 
at least some of the misclassification patterns. For the 
simulation settings in Table 5, the poststratified, single 
frame raking, Hartley, and PML estimators exhibit large 
bias but nevertheless have smaller mean squared error than 
the fixed weight and bias-corrected estimators; this MSE 
ordering does not hold in some of the other simulation 
settings. 

Mecatti (2007) and Rao and Wu (2010) argued that the 
fixed weight multiplicity estimator Y,,. is unbiased if the 
only misclassification is among domains that belong to the 
same number of frames. Misclassifying observations from 
domain ab to domain ac (pattern c) results in no bias 
because the weight adjustment in both domains is 1/2. In 
practice, though, one would expect pattern (c), with two 
errors in domain membership (not reporting membership in 
frame B and erroneously reporting membership in frame C), 
to be less likely to occur in practice than misclassifying an 
observation in ab as either a or abc; Y,,, can be very 
sensitive to the latter forms of misclassification. Although a 
fixed weight estimator is insensitive to misclassification 
among domains in which the weight adjustments are equal, 
in these simulations every fixed weight estimator exhibits 
significant bias for at least some misclassification patterns. 

Tables | to 5 show that each estimator from Section 2 
can exhibit severe bias from domain misclassification. We 
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recommend that the possible extent of domain misclas- 
sification be studied during the survey pretesting phase, so 
that this information can be used in the survey design. If 
misclassification probabilities are known accurately, then it 
may be possible to choose a fixed weight estimator that is 
insensitive to the presumed form of misclassification. When 
a misclassification-robust estimator cannot be found or 
when it is inefficient, the fixed weight estimators can be 
adjusted to reduce the bias. It should be noted that the bias- 
corrected weights proposed in Section 6.1 are sensitive to 


Table 3 
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the input misclassification probabilities. They also do not 
account for other nonsampling errors such as nonresponse; 
applying the misclassification weight adjustments in Section 
6.1 followed by the nonresponse weight adjustments 
described in Brick et al. (2011) may result in final weights 
that correct neither for misclassification nor for non- 
response. If domain misclassification and nonresponse are 
both present, weight adjustments are needed that deal with 
both problems simultaneously. 


Estimated bias for dual frame misclassification, with , = 200, ng = 100, a cluster sample taken from frame A and a simple 


random sample taken from frame B. MPA and MPB refer to the misclassification patterns for frames A and B 


MPA MPB_ P(1/2) Y(1/2)postr YC/2)pose2 71/2), YQ/3) YQ/3), YQ) ne Yoo Youll Yeuim « Vomenin 

a a -148 -142 -139 -148 -155 -155 -170 -312 63 -119 -172 -184 
a b -5,090 4,199 4,599 -72 -6,774 -84 -10,144 -4,976 1,210 361 Spam 220 1,025 
a c -5,069 -1,088 -851 -72 -6,759 -96 -10,139 -4,800 -1,994 ide 36) -3,216 
a d -39 -8,379 -8,383 =35 -63 -58 -111 yp Sie ESS) esto) -6,996 
b a 1,168 -1,221 -1,258 -79 768 -63 -32 1,395 -1,690 -1,663 -2,514 -3,170 
b b -3,716 2,979 3,236 60 -5,784 We -9,918  -2,815 -86 1,776 -346 -2,087 
b c -3,704 -2,108 -2,074 73 -5,771 92 -9,905 -2,561 -2,970 -1,410 -3,267 -5,814 
b d Se -9455 -9,610 CB) 926 123 144 1,609 -7,285 -7,317  -7,938 -9,498 
c a alte) 1,281 1,304 -66 IP? -58 -4) 1486 = 1,831 1,652 943 840 
c b -3,879 5,545 6,087 -118 -5,971 -126 -10,156 -2,972 3,532 4,597 2,405 1,683 
c c -3,811 318 636 -44 -5,893 -42 -10,058  -2,671 110 1,128 -784 -2,328 
c d 1,423 -6,858 -6,973 191 1,022 206 220 1,824 -4,328 -4,014 -4,516 -5,624 
d a -33 2,282 2,290 -28 -35 -32 -40 -148 3,627 Syllstsy Sia l(08) 3,728 
d b -4,974 6,514 7,123 46 -6,660 30 -10,033 -4,863 4,768 6,274 4,742 4,549 
d c -4,95] 1,412 1,883 80 -6,621 84 -9,961 -4,682 1,357 2,863 1,451 388 
d cam 42 -5,987 -5,991 53 40 52 37 -126 -2,899 -2,780 —_-2,791 -3,317 

Table 4 = 

Estimated MSE for dual frame misclassification, with , = 200, mg = 100, a cluster sample taken from frame A and a simple 


random sample taken from frame B. MPA and MPB refer to the misclassification patterns for frames A and B 


MPA MPB Y(1/2) Y(1/2)post ¥C/2pose2 Y(1/2)p- Y(2/3) Y¥(2/3)p-  ¥() Vg Von vou Yeatie « Yen, Yoreee 
a a | 10,916 8,912 8,899 10.916 11,092 11,092 —-:11,879 10,975 9,250 9,155 10,109 9,418 
a b | 11,786 10,186 10,324 11,157 12,743 ~—s«-'11,503 15,463 12,253 8,906 9,391:10,123 9,231 
a c | 11,983 9,575 9,537 11.409 12,922 «11,814 ~—-:15,600 12,395 9,575 9,279 10,391 10,039 
a d | 11,042 12,357 12,375 10.941 11,250 11,173 12,056 11,051 11,591 11,605 12,229 12,053 
b a | 10,698 9,133 9,154 10.872 10,921 11,049 11,875 10,823 9,255 9,151 10,195 9,766 
b b | 10,957 9,803 9,867 11.071. 12,033 11,413" 15,262 11/215 8,681 "8,748 "99,610 ~~ 9,182 
b c | 11,115 9,860 9,846 11.272. 12,172 11,675 15,361 11,306 9,721 9,252 10,558 10,970 
b d | 10,988 13,269 13,408 115046" "11222" “Wen” “aai43" 084 12,4847 191947" 131279" 1398 
c a | 10,995 9,090 9,073 11,1060 = 1,187 1254 12,0287 1112500 "9,309 9, 190' 97798) 9889 
¢ b | 11,104 10,779 11,015 11,090 12,162 11,380 15,348 11,430 9,450 9,724 9,754 9,144 
c es | i 155 9,425 9,400 11,189 12,234 11,600 15,424 11,389 9,219 9,064 9,868 9,658 
c d | 10,922 11,328 11,421 {0.8962 11,1210) 41,091. © 11,929. 14,017 10,759 “10,456. 111Si 9 Wipike 
d a | 11,011 9,080 9,045 10.920 11,181 11,103 11,913 11,041 9,873 9,579 10,375 10,135 
d b | 11,838 1357 11,669 11,164 12,723 +=—«11,453 15,299 12,337 10,258 10,848 11,009 10,403 
d c | 11,804 9,334 9,371 11,159 12,707 11,548 15,298 12,224 9,349 9,507 10,102 9,442 
d dik 11,879 10,839 10,854 10.989 11,355 11,199 12,059 11,195 10,440 10,302 10,916 _10,519 
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Estimated bias and ./MSE for misclassification in a 3-frame survey, with n, =n, = nc = 200 and a simple random sample 


taken from each frame. MPA refers to the misclassification patterns for frame A. Pattern (a) has no misclassification; (b) o 


A = 0.8, 


aa 


A les, = A = A = , 2 Ae eas z v “ u 
P.,ab = 0.1, Pa.abe = 0.1, Peb.ab = 1, Pacac = 1, ae or (c) Poa 1, ab ab pee Ochre = 0.1, alae =1, Oehe.ahe aly (d) a a 
1, 7b,ab = 0.9, ab,abc = 0.1, OG cae F 1, Orseuhe =; (e) ia = 1, i5,ab = 0.8, Da hea = 0.1, ab,abe = 0.1, Doane a 1, Discuss =1 


A A A 


MPA ive ave. postl Fave, post2 Yave, be Y3/3 Y4/3, be For Yy YpML Yor, rake 
a ie -8 Si 28 -8 3) 5 20 ) -26 -208 
b -938 -1,409 -1,478 57 -586 77 107 -2,039 -5,676 -5,624 
bias c -26 -485 -508 -26 6 6 6 -324 -825 -957 
d -231 -514 -557 104 108 108 85 -326 -1,321 -1,438 
e 704 287 247 34 697 27 -4 1,488 1,420 1,193 
a 9,003 4,419 4,410 9,003 10,013 10,013 13,108 7,990 7,281 75298 
se b 8,961 4,711 4,730 8,955 SES aye? o958) 13,092 8,085 9,107 9,074 
MSE Cc 9,119 4,432 4,422 9,119 10,140 10,140 13,238 8,112 7,396 7,422 
d 8,894 4,405 4,405 8,893 9.874 9,874 12,919 T2957 7,414 7,433 
e 9,088 4,438 4,424 9,059 10,071 10,046 13,180 8,254 7,621 7,581 


7. Design issues 


As discussed in Section 1, multiple frame designs can 
give better coverage and precision than a single frame 
survey with equivalent cost. The design problem is more 
complex than with a single frame survey, though, since a 
design that is optimal for frame A and frame B separately 
may not be optimal for the combined sample. Similarly, a 
design that is optimal when estimator Y(1/2) is used may 
not be optimal for Yk. 

Hartley (1962, 1974) derived optimal designs for the 
estimator Y(6 4) when a simple random sample is taken in 
each frame. The optimal sample sizes n, and n, depend on 
the relative costs of sampling from the two frames, and on 
the means and variances of the response variable within the 
domains. Cochran (1977, pages 144-145) described the dual 
frame survey in Figure 1 in his chapter on stratified 
sampling. In this situation, N, and N,, may be known, 
especially if frame B is a list frame. Domains a and ab are 
treated as strata; there is one sample from stratum a and 
two independent samples from stratum ab. The design 
problem may be approached as a stratified sample design. 

In general, the optimal design is a function of sampling 
variances and nonsampling errors in each frame, as well as 
of the estimator chosen. Biemer (1984) and Lepkowski and 
Groves (1986) discussed designs for the situation in Figure 
1 when a stratified multistage sample is taken from each 
frame, using the Hartley estimator Y(6 1). Lepkowski and 
Groves (1986) considered interviewer variability and mode 
bias as well as sampling error when assessing the precision 
of various designs; frames with less mode bias are allotted 
higher sample sizes. Brick (2010) derived optimal alloca- 
tions in the presence of nonresponse, and found that con- 
sidering the nonresponse when allocating resources to the 
two frames can greatly increase efficiency in both screening 
and overlap dual frame surveys. 


One of the advantages of a multiple frame design is its 
flexibility; it is well suited for a modular approach to survey 
design. In some situations, it may be practical to take an 
initial sample from the general population (frame A in 
Figure 4). The design of the samples from frames B and C, 
corresponding to subpopulations of interest, can then be 
determined using information in the frame-A sample. For 
example, if the frame-A sample yields too few engineers, 
the sample size from an engineering society membership list 
frame can be correspondingly increased. 

Rao (2003) suggested using multiple frame surveys to 
improve the accuracy of small area estimates in subgroups 
of interest. In this application, supplemental surveys can be 
taken in frames with high concentrations of subgroups of 
interest. As research needs change, resources can be re- 
allocated among the supplemental surveys without changing 
the main survey design. A crime victimization survey that 
uses a national area frame may be supplemented by local 
victimization surveys; as victimization patterns change, the 
local surveys can have different sample sizes or be moved to 
other geographic regions. 

Most survey designs attempt to achieve efficiency for the 
important responses, but in some situations a design that is 
efficient for one response is inefficient for others. For a 
survey in which each of four responses of interest was 
highly correlated with one of the possible stratification 
variables (but not necessarily correlated with the other strati- 
fication variables), Skinner, Holmes and Holt (1994) used a 
multiple frame survey with four independent stratified 
samples drawn from a common sampling frame. Each sam- 
ple was stratified using the stratification variable that was 
correlated with one of the responses of interest, and so was 
highly efficient for that response. In estimation, information 
from all four samples was combined. 

Multiple frame surveys can also be used in conjunction 
with sequential or adaptive sampling methods to improve 
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yield of a rare or hard-to-reach population such as recent 
immigrants. For example, a stratified multistage sampling 
design might be employed for frame A, while an adaptive 
cluster sampling design (Thompson 2002) might be used for 
frame B. Domain estimates can be calculated separately for 
the two designs, and then combined using methods in 
Section 2. In this situation, frames A and B may completely 
overlap, so that domain misclassification will not be an 
issue. 


8. Conclusions 


In this paper, we have summarized some of the issues 
involved in using multiple frame methods for U.S. house- 
hold surveys. Multiple frame designs have great potential 
for improving efficiency of data collection in household 
surveys. They can improve coverage by combining in- 
complete frames, improve the accuracy of estimates for 
subgroups or rare populations, and increase the flexibility 
and responsiveness of federal data collection. Multiple 
frame surveys can facilitate sampling hard-to-reach popula- 
tions such as recent immigrants or households with infants, 
a general population survey can be combined with an 
adaptive sample design or a list frame of births. 

In many cases, multiple frame surveys can provide more 
accurate estimates of population quantities without in- 
creasing data collection costs, but the design and estimator 
must be chosen carefully to realize these savings. A multiple 
frame survey, like other surveys, may have nonresponse, 
mode effects, and measurement errors. In addition, unless 
all of the frames consist of the entire population, multiple 
frame survey estimators can be sensitive to domain 
misclassification. One correction for misclassification was 
given in Section 6, but more research is needed on these 
challenges. Effects of domain misclassification, non- 
response, and mode bias may be confounded. A designed 
experiment may help disentangle these effects. We are 
currently studying the relation among these three types of 
nonsampling errors. Each form of nonsampling error affects 
the accuracy of multiple frame estimators, and anticipated 
nonsampling errors need to be incorporated in an optimal 
design. 
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Ten years of balanced sampling with the cube method: An appraisal 


Yves Tillé ! 


Abstract 


This paper presents a review and assessment of the use of balanced sampling by means of the cube method. After defining 
the notion of balanced sample and balanced sampling, a short history of the concept of balancing is presented. The theory of 
the cube method is briefly presented. Emphasis is placed on the practical problems posed by balanced sampling: the interest 
of the method with respect to other sampling methods and calibration, the field of application, the accuracy of balancing, the 
choice of auxiliary variables and ways to implement the method. 


Key Words: Sampling; Balancing; Horvitz-Thompson estimator. 


1. Introduction 


While the idea of balanced sampling has been around 
since the early days of survey statistic development, ap- 
plying the concept has been difficult because almost all the 
proposed methods have either been enumerative or rejective 
and required considerable computation time. The algorithm 
of the cube method was proposed in 1998 by Deville and 
Tillé, and a first implementation was written by three 
students of the Ecole Nationale de la Statistique et de Il’Ana- 
lyse de l’Information of Rennes in France (see Bousabaa, 
Lieber and Sirolli 1999). Finally, the method was published 
in Tillé (2001) and Deville and Tillé (2004). Since this time, 
several implementations have been proposed and several 
survey managers have used the cube method to select 
samples, the most important applications being the New 
French Census and the French Master Sample. 

Our aim is to assess 10 years of development and use of 
balanced sampling in order to better ascertain when and 
how the cube method can be used to select samples of 
householders or establishments. After discussing the con- 
cept of balanced sample and balanced sampling in Section 
2, we give a list of particular cases in Section 3. In Section 
4, we briefly trace the history of this concept for both the 
model-based and design-based frameworks. Next, in 
Section 5, we provide a brief overview of the cube method, 
which is a class of algorithms that allows us to select 
randomly balanced samples with given inclusion proba- 
bilities (see Deville and Tillé 2004; Tillé 2001, 2006b). We 
try to present the main principles of this algorithm without 
giving a detailed description of the technicalities of the 
method. Section 6 is devoted to the principles of variance 
estimation in balanced sampling. Finally, in Sections 7, we 
discuss the interest of balanced sampling in practice and 
compare balanced sampling with other sampling methods 
and calibration. We also give a list of recent applications. 
This Section also deals with the accuracy of balancing, the 


choice of auxiliary variables and ways to implement bal- 
anced sampling. The paper ends with an exhaustive bibli- 
ographical list of references on balanced sampling and their 
applications. 


2. Balanced sampling 


2.1 Definition of a balanced sample 


Consider a sample s of size n that is a subset of a finite 
population U of size N. A sample is said to be balanced if, 


for a vector of auxiliary variable x, = (x,),..., ipo 23 Xs) 
1 1 
yen ley () 
kes N kv 


which means that the sample means of the x-variables match 
their population means. 

Brewer (1999) drew a distinction between a balanced 
selection of samples and a random selection of samples. 
However, a balanced sample may be selected randomly. If a 
random sample S is selected randomly, then each unit of 
the population has an inclusion probability mt, of being 
selected. In this case, a random sample must satisfy the 
following balancing equations: 


ee Oe (2) 


kes My keU 


In other words, in a balanced sample, the total of the x- 
variables are estimated without error. Several authors like 
Cumberland and Royall (1981) and Kott (1986) would call 
a sample that satisfies Equation (2) a ‘x-balanced sample’, 
as opposed to a ‘mean-balanced sample’ defined by 
Equation (1). Nevertheless, in this paper, we will consider 
that (1) is only a particular case of (2) that occurs when 
m, = n/N or when the sample is not selected randomly. 
We refer to both cases as a balanced sample. 
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2.2 Balanced sampling design 


Let p(s) denote the sampling design, i.e., the probability 
that sample s is selected, such that p(s) = Pr(S = s), 
where S is the random sample and n(S) the size of the 
sample S. According to the definition of Deville and Tillé 
(2004), a sampling design p(-) is said to be balanced on 
auxiliary variables x,,..., x, if the Horvitz-Thompson esti- 
mator satisfies Equation (2). In a balanced sampling design, 
the inclusion probabilities are decided prior to sampling. A 
balanced sampling can be viewed as a kind of calibration 
that is directly integrated into the sampling design. The 
main problem is that the balancing equations (2) can rarely 
be exactly satisfied. We refer to this difficulty as the 
‘rounding problem’. 


Example \. If N = 4,n = 2, t% 1/2, for all kK e U and 
x, = 0, x,= 1, x,= 2, x,= 4, then the balancing equations 
given in (2) becomes 


\ l 
Pe ere ain 


N kes N keU 


which is equivalent to 
n 
xX = — >) ap (3) 
2 WR 

Since 

n 2 

—>'x, == (0414244) = 355, 

N keU 4 
and the left side of (3) is always an integer, then an exactly 
balanced sample does not exist. 

Indeed, sample selection is an integer problem. The cube 
method therefore aims to select a sample that exactly sat- 
isfies the inclusion probabilities 1, while remaining as bal- 
anced as possible. 


3. Special cases of balanced sampling 


3.1 Unequal probability sampling and stratification 


Some well-known sampling designs are particular cases 
of balanced sampling: 

1. Sampling with a fixed sample size is a particular case 
of balanced sampling. In this case, the only balancing 
variable is 1,. The balancing equations given in (2) 
become 


y= Ere 
keS Uke keU 
which means that the sample size must be fixed. 

2. Stratification is a particular case of balanced sam- 
pling. Suppose that the population is partitioned in 
H stata U,, =)... H, 01 sizes Ne Lee, 
H, and that a sample is selected in each stratum by 
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means of simple random sampling without replace- 
ment with fixed sample size n,, h = 1,..., H. In this 
case, the balancing variables are the indicator vari- 
ables of the strata 


l afkeU, 
Ons ; 
0 otherwise. 


Under a stratified design, the Horvitz-Thompson 
estimators of the sizes of the strata exactly equal the 
sizes of the strata, which is a property of balancing on 
the indicator variables of the strata. Indeed, since the 
inclusion probabilities in stratum A are 1,= 
n,/N,, keU,, the balancing equations become 


5 Niu Dy es 


kes Ny, keU 


and are exactly satisfied. 


These two designs are well known and are commonly 
applied in official statistics in order to reduce variance. The 
more general concept of balancing allows more freedom to 
choose the most appropriate balancing variables that will 
improve the accuracy of the estimators. 


3.2 Overlapping strata 


Constructing a stratified sampling design is often a diffi- 
cult exercise. Statisticians often try to stratify using several 
qualitative variables. However, in most cases, crossing all of 
the strata of all the variables will cause the cells to become 
too small for a sample to be selected in each cell. In the 
context of calibration, statisticians generally calibrate on 
marginal totals and not on all the cells contained in a 
contingency table. Since a balanced sampling can be viewed 
as a kind of calibration that is directly integrated in the 
sampling design, one would also like to balance using only 
marginal totals. Nevertheless, the usual theory of strati- 
fication does not allow overlapping strata since the strati- 
fication must be a partition of the population. Now, the cube 
method enables us to directly balance on totals of over- 
lapping strata by simply using the indicators of the strata as 
balancing variables. 


3.3 Balancing on a constant 


Another interesting special case of balanced sampling 
occurs when a constant is used as a balancing variable. If 
x, = 1 forall k < U, the balancing equations become 


y= Yen. 


keS Ty keU 


Actually, 
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keS Thy 
is the Horvitz-Thompson estimator of N. This means that, 
if a constant is used as a balancing variable, the estimated 
population size matches the known size N, which is far 
from being a given when the statistical units are selected 
with unequal inclusion probabilities. 


4. History of the concept of balancing 
and existing methods 


The idea of balanced sampling is very old and is linked 
to the vague concept of representativeness that was already 
used by Kiaer (1896, 1899, 1903, 1905). The first paper 
dedicated to the selection of a balanced sample is due to 
Gini (1928) and Gini and Galvani (1929) who selected a 
sample of 29 from the 214 Italian districts in order to match 
several population totals. Both Neyman (1952) and Yates 
(1960) condemned the paper of Gini and Galvani essentially 
because this sample was not randomly selected (see Langel 
and Tillé 2010). The first methods for selecting a random 
balanced sample were proposed by Yates (1946) and Thionet 
(1953), but these methods were rejective in the sense that 
they involved selecting samples or changing units randomly 
in the sample until a balanced enough sample was obtained. 

In the model-based framework, Royall (1976a, b) advo- 
cated the use of balanced sampling in order to reach the 
optimal strategy and to protect against mis-specification of 
the model. (see also Royall and Pfeffermann 1982; Kott 
1986; Cumberland and Royall 1988; Royall 1988; Tirari 
2006; Nedyalkova and Tillé 2009). While several methods 
for selecting a balanced sample are presented in the book of 
Valliant, Dorfman and Royall (2000), these methods do not 
necessarily specify the inclusion probabilities of the sample. 
In the model-based framework, it is important to have a 
balanced sample. However, this sample does not always 
need to be randomly selected. 

Hajek (1981) also advocated the use of balanced sam- 
pling. For Hajek, a balanced sampling is a particular case of 
representative strategy, a strategy being a couple made of a 
sampling design and an estimator. A representative strategy 
is a strategy that estimates the totals of auxiliary variables 
without error. In this sense, a balanced sampling design with 
the Horvitz-Thompson estimator is a representative strategy. 
Hajek (1981) proposes a rejective procedure that consists of 
selecting a sequence of samples until a balanced one is 
obtained. Rejective procedures have two drawbacks: if 
several balancing variables are used, the procedure can be 
very slow; secondly, the inclusion probabilities of rejective 
designs are not the same as the original design. The inclu- 
sion probabilities of statistical units that are close to the 
population means are increased to the detriment of the units 


21% 


that are far from the center (see for instance the simulations 
of Legg and Yu 2010). 

Another method of selection consists of enumerating all 
the possible samples, and then constructing a sampling 
design only to select the samples that are adequately bal- 
anced. Such a design can be constructed by using linear 
programming. This technique was applied by Ardilly (1991) 
to select the primary units of the French master sample. 
Nevertheless, this method can only be applied on small pop- 
ulation sizes because of the combinatory explosion of the 
number of samples when the size of the population is large. 

Deville, Grosbras and Roth ( 1988) and Deville (1992) 
proposed multivariate methods for balanced sampling with 
equal inclusion probabilities. Hedayat and Majumdar (1995) 
have proposed the adaptation of an experimental design 
technique that would enable a balanced sampling design to 
be constructed. Again, this technique is restricted to equal 
inclusion probabilities. Finally, the cube method was pro- 
posed by Deville and Tillé (2004). This method is general in 
the sense that the inclusion probabilities are exactly satis- 
fied, that these probabilities may be equal or unequal and 
that the sample is as balanced as possible. 

Fuller (2009) studied a rejective procedure by fixing a tol- 
erance interval outside of which the sample is rejected and 
proposed an estimator of variance. Even if the inclusion 
probabilities are changed with a rejective procedure, 
Fuller (2009) shows that efficient estimates are obtained 
by using the inclusion probabilities of the original design. 
Using a set of simulations, Legg and Yu (2010) com- 
pared this rejective procedure to the cube method and 
showed that both methods perform equally. Finally, 
Dudoignon and Vanheuverzwyn (2006) proposed a fast 
method of balanced sampling for marginal totals, whereas 
Périé (2008) proposed a method based on permanent 
random numbers that provides a balanced sample. With 
the Périé (2008) method, the inclusion probabilities are 
only approximately satisfied. 


5. The cube method 


5.1 Main ideas 


The cube method (see Deville and Tillé 2004; Tillé 2001, 
2006a, b; Ardilly 2006) is a class of sampling algorithms 
that selects a balanced sample and exactly satisfies a set of 
given inclusion probabilities. The cube method is an 
extension of the splitting method that was developed by 
Deville and Tillé (1998). It is based on a random trans- 
formation of the vector of inclusion probabilities until a 
sample is obtained such that: 

(i) _ the inclusion probabilities are exactly satisfied, 

(ii) the balancing equations are satisfied to the furthest 

extent possible. 
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The name of the method comes from the geometric repre- 
sentation of a sampling design. Indeed, a sample may be 
represented by a vector of samples indicators: 


s = (J[les]...I[k € s]..I[N € s])’, 
where I[k € s] takes value 1 if k es and 0 if not. A 


sample may thus be viewed as a vertex of an N-cube as 
showed in Figure 1. 


(O11) eee) 


(O10) 
(110) 


(101) 


(O00) (100) 


Figure 1 Possible samples in a population of size N = 3 


Let us also define 
E(s) = ) p(s)s = 7, 
seS 
where 1 = [1] is the vector of inclusion probabilities. The 
balancing equations 


may also be written 


Dax, ee iy Ips (4) 
keU keU 
where s, € {0,1} and X, = x,/7,, keU. Expression (4) is 
a system of equations with unknowns values s, that define 
an affine subspace in R” of dimension N — p denoted by 
Q, where 


Oe . € R*|>x, Sie Da} 
keU keU 

The problem of selecting a balanced sample may thus be 
reformulated. A balanced sampling design consists of 
choosing a vertex of the N-cube (a sample) that remains on 
the linear sub-space Q. Figures 2 and 3 respectively show 
two examples: the first one is a constraint of fixed sample 
size and the second one is a constraint that generates a 
rounding problem. 
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Figure 2 Possible samples in a population of size N =3 witha 
constraint of fixed sample size n = 2 
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The Cube method (Deville and Tillé 2004) is divided 
into two phases: the flight phase and the landing phase. The 
flight phase is a random walk that begins at the vector of 
inclusion probabilities and remains in the intersection of the 
cube and the constraint subspace. This random walk stops at 
a vertex of the intersection of the cube and the constraint 
subspace. At the end of the flight phase, if a sample is not 
obtained, the landing phase entails in selecting a sample that 
is as close as possible to the constraint subspace. 


(O11) (111) 


(O10) 
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Figure 3 Possible samples in a population of size N =3 witha 
constraint and a rounding problem 


(000) 


Example 2. If the constraint is the fixed sample size, the 
flight phase randomly transforms a vector of inclusion 
probabilities into a vector of 0 and 1. At each step of the 
algorithm, the vector of inclusion probabilities is trans- 
formed randomly, but the sum of inclusion probabilities 
must remain equal to n. For instance, with m= (O.5—0:35 
0.5,0.5) and n = 2, we are able to obtain the following 
sequence of vectors: 
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0.5 0.6666 1 
0.5 a 0.6666 aN 0.5 mt 0 
0.5 0.6666 0.5 
0.5 0 0 0 


The algorithm ends when all the components of the vector 
are equal to 0 or 1. 


Tt = 


Example 3. If the constraint is the fixed sample size, a 
rounding problem appears if the sum of inclusion proba- 
bilities is not an integer. If there is a rounding problem, then 
some components cannot be set to zero. For instance, with 
m = (0.5, 0.5, 0.5, 0.5, 0.5) and 


Te =: 


keU 


we may observe the following sequence of vectors: 


0.5 0.625 0.5 1 1 
0.5 0 0 0 0 

tm =| 0.5 || 0.625 |] 0.5 |] 0.25 |] 0.5 | = 2”. 
0.5 0.625 l l I 


0.5 0.625 0.5 0.25 0 


In this case, the flight phase cannot end with a vector of 0 
or 1 of which the sum is equal to 2.5. In this case, the 
flight phase ends with a vector containing one non-integer 
component. 


5.2 The flight phase 


The first step of the flight phase is presented in Figure 4 
for a very specific case: the population size N = 3. The 
only balancing constraint is the fixed sample size n = 2. At 
the first step, a vector u(0) must be chosen. This vector 
may be chosen freely but must be such that 7 + u(0) 
remains in the subspace of constraints. Actually, the cube 
method is a family of methods that depends on the way the 
vector u(0) is chosen. This vector may be chosen randomly 
or not. 

If, from 7, we follow the direction given by vector 
u(0), then we will necessarily cross a face of the cube. Let 
us consider this point denoted on Figure 4 by 12(0) + 
ie (0)u(0). Now, if, from 2, we follow the opposite 
direction, i.e., the direction given by vector —u(0), we will 
also cross a face of the cube. Let us consider this point 
denoted on Figure 4 by 1(0) — 3 (0) u(0). At the first step, 
vector ™(0) = m is modified randomly. Vector m(1) will 
be set to 2(0) + A; (0) u(0) or to 2(0) — A5(0)u(0). The 
choice is done randomly in such a way that E[n(1)] = 
™(0). At the end of the first step of the flight phase, we 
have thus jumped on a face of the cube, which means that at 
least one component of (1) is equal to 0 or 1, ie, the 
problem is reduced from a problem of sampling from a 
population of size N = 3 to a population of size N = 2. 
In N steps at least, the flight phase is thus completed. 
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Figure 4 Flight phase in a population of size N =3 with a 
sample size constraint n = 2 


More generally, the flight phase is a random walk in the 
intersection of the balancing subspace and the cube. This 
random walk stops at a vertex of the intersection of the cube 
and the subspace. The flight phase is defined by the fol- 
lowing class of algorithms. First initialize with m(0) = tr. 
Next, at time ¢ = 0, ...., 7, 

1. Generate any vector u(t) = [u, (¢)] # 0 such that 

(i) u(t) is in the kernel of matrix A = (Ged en 
X,/ My »--5 Xy/My), ie, Au(t) = 0, 
(ii) u,(t) = 0 if 7,(¢) is integer. 
2. Compute A;(t) and A3(¢), the largest values such 
that 
0< x(t) +A, (u(t) <1, 
0< x(t) -A, (u(t) < 1. 
3. Compute 
oe m(t)+A;(t)u(t) with probability g,(t) 
m(t)—A3(t)u(t) with probability q,(#), 
where 9,(t)=A,(1)/{Aj() +A5~} and g,(t) = 
Vio Gul): 
The flight phase stops when it is no longer possible to find a 
vector u(t) # 0. 


5.3 Landing phase 


If, at the end of the flight phase, the balancing equations 
are not exactly satisfied, there is a need for a landing phase. 
Let nt = = [n,] be the vector obtained at the last step of the 
flight phase. It is possible to prove (see Deville and Tillé 
2004) that 

card(U") < p, 
where 
U'= {keU|0< 2, < 
and p is the number of balancing variables. The aim of 
the landing phase is to find a sample s_ such that 
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E(s|t) = %, which is almost balanced. There are two 
ways of selecting such a sample: 

1. The flight phase by linear programming | consists of 
considering all the possible samples of U’. A cost is 
assigned to each sample. This cost, is, for instance, 
the distance between the sample and the subspace of 
constraints. Next, one looks for a sampling design on 
U* that minimizes the expected cost and that 
satisfies the inclusion probabilities 7. This problem 
can be solved because the number of samples to 
consider is reasonable due to the small size of U". 

2. The flight phase by suppression of variables may be 
used when the number of balancing variables is too 
large for the linear program to be solved by a simplex 
algorithm (p > 20). With this method, an auxiliary 
variable is dropped at the end of the flight phase. 
Next, we can return to the flight phase until it is no 
longer possible to ‘move’ within the constraint sub- 
space. The constraints are then relaxed successively 
according to an order of preference. 


6. Variance and variance estimation 
6.1 A residual technique 


The variance of the Horvitz-Thompson estimator can be 
estimated by using a residual technique developed in 
Deville and Tillé (2005). The residual technique is compa- 
rable to the technique used to estimate the variance of the 
calibration estimator and has been validated by a set of sim- 
ulations. The estimated variance of the Horvitz-Thompson 
estimator is thus very similar to the estimated variance of a 
generalized regression (GREG) estimator. Nevertheless, the 
variance of the GREG estimator is generally underestimated 
because it does not take into account the randomness of the 
weights. Indeed, if the usual variance of the GREG esti- 
mator is computed for the special case of poststratification, 
we obtain the variance of a stratified design with propor- 
tional allocation. The variance of the poststratified estimator 
is nevertheless larger than the variance in a stratified design 
with proportional allocation. 


6.2 Approximation of variance 


If the balanced sampling design has a large entropy, 
Hajek (1981) and Deville and Tillé (2005, method 4) have 
proposed the following approximation of the design 
variance given by: 

~ b 
var, (Pn) = Vay, (Yr) Sater 


app 


Seto lees 


keU 1, 


(5) 


where the subscript p denotes the sampling design, 


»=(Da ma) yd, ad: 
keU keU 


Ty k 
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and the d, are the solution of the nonlinear system 


Te | it, ie d,- aX d, a 


Ty leU Tl, 


all 
dex 


,k €U. (6) 
Ty 

The entropy of the sampling design depends on the way 
vectors u(f) are chosen during the flight phase. In order to 
increase the entropy, vector u(t) can be chosen randomly 
or the population can be randomly sorted before selecting 
the sample. 

Expression (5), which only uses the first-order inclusion 
probabilities, was validated by Deville and Tillé (2005) 
under a variety of balanced samples regardless of how the y- 
values were generated. An approximation very close to 
Expression (5) was obtained by Fuller (2009) and Legg and 
Yu (2010) for a balanced sampling design obtained by a 
rejective procedure in the case of an initial design that uses 
Poisson sampling. These approximations do not take the 
rounding problem into account. 


6.3 Estimation of variance 
Deville and Tillé (2005) proposed a family of variance 
estimators for balanced sampling, of the form 


var(Yn) = > j Ey ; (7) 


keS 1, 


where 


os X/X, Xp Veo 
= ies Sy: i a) 
eS 1; teS Ty 


and the c, are the solutions of the nonlinear system 


=I 
Gx. ee @idK 
Beek '{S | k - (8) 


Tie tease eed Tey 


Jag ope of 


which can be solved by a fixed point algorithm. 

In Deville and Tillé (2005), simpler variants of c, were 
also proposed. For instance, one can use the alternative 
values, 


d ie Tl, )s 
P 


that are very close to c,. The estimator var(Y n) 1S 
approximately design-unbiased because it is an estimator by 
substitution of the approximation given in expression (5), 
(for more information regarding estimators obtained by 
substitution, see Deville 1999), which is a reasonable 
approximation of the variance under the sampling design. 

It is not easy to use bootstrap method to estimate the 
variance in the context of balanced sampling. Balanced 
samples with replacement should be selected from the 
original sample. A generalization of the cube method for 
balanced sampling with replacement has not yet been 
described. A solution, proposed by Chauvet (2007), consists 
of reconstructing an artificial population from the sample. 
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Next, bootstrap samples are selected by using balanced 
sampling. Another solution was proposed by Fuller (2010) 
for balanced rejective sampling. Breidt and Chauvet (2010a) 
have proposed an alternative method where a martingale 
difference representation of the cube method is used in order 
to approximate second-order inclusion probabilities, which 
enables us to construct a nearly unbiased variance estimator. 


7. Balanced sampling in practice 


7.1 Interest of balanced sampling 


In the model-assisted and the model-based frameworks, a 
balancing sampling design with the Horvitz-Thompson 
estimator is often the optimal strategy (see Nedyalkova and 
Tillé 2009). Indeed, when the sample is balanced, the vari- 
ances of the Horvitz-Thompson estimators of the auxiliary 
variables are equal to zero. Under a linear model, the vari- 
ance of the Horvitz-Thompson estimator of the interest 
variable will only depend on the residuals of the model. 

The advantages of balanced sampling are as follows: 

(i) Balanced sampling increases the accuracy of the 
Horvitz-Thompson estimator. This point has been 
developed in Section 6. Indeed, the variance of the 
Horvitz-Thompson estimator only depends on the 
residuals of the regression of the interest variable by 
the balancing variables. 

(ii) Balanced sampling protects against large sampling 
errors. Indeed, the most unfavourable samples have 
a null probability of being selected. 

(iii) If the variable of interest is well explained by the 
auxiliary information, in model-based inference, 
balanced sampling protects against a mis-speci- 
fication of the model. This point is largely de- 
veloped by Royall (1976b, a) and Valliant e¢ al, 
(2000). A recent discussion of this important ques- 
tion is given in Nedyalkova and Tillé (2009, 2010). 

(iv) Balanced sampling can ensure that the sample sizes 
in planned domains are not too small or - much 
worse - equal to zero. Indeed, if an indicator vari- 
able of the domain is added in the list of auxiliary 
variables, the size of the domain is then fixed in the 
sample. 

(v) Balanced sampling allows us to avoid random 
weights. With balanced sampling, the Horvitz- 
Thompson weights can be used. If the sampling 
design does not contain any balancing constraints 
(for instance with Poisson sampling) the weighting 
system obtained by a calibration procedure be- 
comes very random, which increases the variance 
of the estimators. If the sample is balanced, the 
weights will be less random even if a calibration 
procedure is used after balancing. 
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The availability of easy to use packages contributed to 
the large use of the cube method in several important 
Statistical processes. The first main application of the cube 
method is selection of the rotation groups for the French 
census. (See Desplanques 2000; Dumais, Bertrand and 
Kauffmann 2000; Durr and Dumais 2001, 2002; Dumais 
and Isnard 2000; Bertrand, Christian, Chauvet and 
Grosbras 2004; da Silva, da Silva Borges, Aires Leme 
and Moura Reis Miceli 2006). For the municipalities with 
fewer than 10,000 inhabitants, five non-overlapping rotation 
groups of municipalities are selected using a balanced 
sampling design with equal inclusion probabilities (1/5). 
Each year, a fifth of the municipalities are surveyed. So after 
5 years, all the small municipalities are selected. For the 
municipalities with more than 10,000 inhabitants, in each 
municipality, five non-overlapping balanced samples of 
addresses are selected with inclusion probabilities 8%. So, 
after 5 years, 40% of the addresses are visited. The bal- 
ancing variables are socio-demographic variables taken 
from the last census. 

In the French master sample, the primary units are 
geographical areas that are selected using a balanced sam- 
pling design (see Wilms 2000; Christine and Wilms 2003; 
Christine 2006). The master sample is a self-weighted 
multi-stage sampling. So the primary units are selected with 
unequal probabilities that are proportional to their sizes. The 
balancing variables are socio-demographic variables taken 
from the last census. Bardaji (2001) and Even (2002) have 
also used balanced sampling to select a sample of benefi- 
ciaries of subsidized jobs. Seven populations are surveyed, a 
balanced sample of beneficiaries is selected in each of the 
populations by using between two and five balancing 
variables according to the populations. 

In the company Electricité de France (EDF), new 
electricity meters allow electricity consumption for each 
household to be measured on a continuous basis. The 
amount of information collected is so large that it is 
impossible to archive all the data. Dessertaine (2006, 2007) 
used balanced sampling to select the time series of 
consumption that must be archived in order to ensure that 
they represent the consumption of the entire French popu- 
lation as accurately as possible. Biggeri and Falorsi (2006) 
used balanced sampling to improve the quality of the 
consumer price index in Italy. Gismondi (2007) tested 
balanced sampling to estimate the number of tourist nights 
spent in Italy. D’Alo, Di Consiglio, Falorsi and Solari 
(2006) and Falorsi and Righi (2008) also proposed using a 
balanced sampling design to estimate totals in small 
domains. Simulations were run by Mari, Barbara, Mitas and 
Passamonti (2007b, a) in Argentina and Chipperfield (2009) 
in Australia to assess the interest of balanced sampling for 
the master sample. 
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At Statistics Canada, Fecteau and Jocelyn (2006) and 
Jocelyn (2006) tested balanced sampling to select a sample 
of businesses. Canadian unincorporated businesses com- 
plete their income tax returns either on paper or elec- 
tronically. More than half of the returns are submitted elec- 
tronically. Balanced sampling was used to select a sample 
from businesses that responded electronically so that, for 
some key variables that are known for the whole population, 
the sample means matched the known population means. 

Balanced sampling can also be used to impute a missing 
value in case of item nonresponse. Indeed, using a model to 
predict an imputation allocates central values, which will 
lead to a biased inference on quantiles. In contrast, a random 
imputation generally increases the variances of the esti- 
mators. In order to solve this dilemma, Deville (1998, 2005, 
2006) and Chauvet, Deville and Haziza (2010c, b) have 
proposed using imputation by prediction and to add a 
residual that is chosen amongst the residuals of the re- 
spondent according to a balanced sampling design. This is 
done to avoid adding a term of variance to the total of the 
imputed variable. 


7.2 Balanced sampling versus other sampling 
techniques 


Unequal probability sampling is a particular case of the 
cube method. Indeed, when the only auxiliary variable is the 
inclusion probability, the sample has a fixed sample size. 
The cube method is a generalization of the splitting method 
(see Deville and Tillé 1998), which includes several sam- 
pling algorithms with unequal probabilities (Brewer’s 
method, pivotal method, corrected Sunter method, see 
Brewer 1975; Sunter 1977; Deville and Tillé 1998; Tillé 
2006b). Stratification is also a particular case of balanced 
sampling. With the cube method, one can balance on 
overlapping strata and use qualitative and quantitative 
variables together. Systematic sampling can even be seen as 
a balanced sampling design on the order statistic related to 
the variable on which the population is ordered. 

Almost all the other sampling techniques are particular 
cases of balanced sampling (except multistage sampling). In 
fact, balanced sampling is simply more general, in the sense 
that all the other methods of sampling can be implemented 
with the cube method. The cube method allows us to use any 
variable for balancing with a reasonable computation time. 
With the more general concept of balancing, strata can 
overlap, quantitative and qualitative variables can be used 
together, and the inclusion probabilities can be chosen freely. 

It is well known that the ratio estimator and the post- 
stratified estimator are particular cases of the regression 
estimator. The regression estimator is also a particular case 
of the calibration estimator (which includes a non-linear 
adjustment). In the same way, balanced sampling is a more 
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general method of sampling that includes almost all the 
other methods. The algorithm of the cube method may seem 
complicated but, once implemented, it enables us to run a 
function with two arguments: the vector of inclusion proba- 
bilities and the matrix of balancing variables. 


7.3 Choice of the balancing strategy 


The main recommendation is to choose balancing vari- 
ables that are closely correlated to the interest variables. As 
with any regression problem, the balancing variables must 
be chosen parsimoniously: one must not choose too many 
balancing variables because, accuracy no longer improves 
with a large number of variables and the instability of the 
variance estimator increases with each additional variable. 
Practically, the aim is not to estimate one variable but a set 
of interest variables. Thus, the set of auxiliary variables 
must be correlated to all the interest variables. Moreover, 
the auxiliary variables should not be too correlated amongst 
themselves. 

Lesage (2008) has proposed a method to balance a 
sample on complex statistics rather than simply using popu- 
lation totals. The main idea consists in balancing on the 
linearized value (or influence function) of the parameter of 
interest. Breidt and Chauvet (2010b) have proposed using 
penalized balanced sampling in order to possibly relax some 
balancing constraints, which can be useful for instance in 
small domain estimation. 

In many cases, the balancing variables contain measure- 
ment errors. For example, in most registers, one can suspect 
errors in the data. Missing values can obviously occur and 
auxiliary variables are often corrected by a method of 
imputation. As for calibration, the fact of having errors in 
the auxiliary variables is not very important as long as the 
calibration is done on the total of the auxiliary variables of 
the register. Indeed, with balanced sampling, the Horvitz- 
Thompson estimator is used and is unbiased even if the 
auxiliary variables are false. The gain in efficiency only 
depends on the correlation between the balancing variables 
and the interest variables. This correlation is rarely affected 
by errors in the balancing variables. 

Several variables can be used to improve small domain 
estimates. To ensure that a domain D is not empty, one can 
simply add the auxiliary variable: 

Di Teeth kes) 


ees ; 
. 0 otherwise, 


which implies that the number of sampled units that belong 


to D is equal to 
np = ix, iam Dt 


keU keD 


if np, is integer, or one of the closest two integers to np if 
np is not an integer. 
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In some cases, it is interesting to balance on auxiliary 
variables in subgroups, domains or strata. An interesting 
procedure described in Chauvet (2009) consists of sepa- 
rately running the flight phase in each stratum. A rounding 
problem will then occur in each stratum. These rounding 
problems can then be merged and a flight phase can be run 
again on the whole population. Finally, the landing phase is 
applied only to the whole population. This procedure 
enables us to roughly satisfy the balancing equations in each 
strata without cumulating the rounding problems. 

The inclusion probabilities must be computed prior to 
sampling. When a linear model is assumed, these proba- 
bilities should in principle be proportional to the errors of 
the model in order to minimize variance (see Tillé and Favre 
2005; Chauvet, Bonnery and Deville 2010a; Nedyalkova 
and Tillé 2009, 2010). This choice generalizes Neyman’s 
allocation for stratified sampling (Neyman 1934). However, 
the inclusion probabilities often need to be chosen on others 
constraints. For instance, in order to construct the rotation 
groups of the French census, the inclusion probabilities must 
all be equal to a fifth. 


7.4 Balancing versus calibration 


Stratification is a particular case of balancing, while post- 
stratification is a particular case of calibration. In stratifi- 
cation and balancing, the weights do not become random. It 
is thus generally a better strategy. Nevertheless, more auxil- 
lary information is needed for balancing. Indeed, for bal- 
anced sampling, the auxiliary variables must be known for 
all the units of the population, whereas, for calibration, only 
the population totals are needed. Balancing is a very 
interesting method for small population sizes. It is thus a 
very good method for selecting primary units in a multi- 
stage sampling design. 

Both techniques can be used together. They are not 
contradictory. The best strategy consists of using balanced 
sampling and calibration together. Indeed calibration can 
resolve the small rounding problem that may remain after 
balancing. At the estimation stage, more auxiliary variables 
are often available because, in order to balance a sample, the 
auxiliary information must be known at the individual level 
while, in order to calibrate the sample, only the population 
totals are necessary. 

Generally, it is recommended to re-calibrate on the 
balancing variables at the estimation stage even if more 
calibration variables are available. If only new variables are 
used in calibration, the effect of balancing can be lost. There 
is, however, one case where calibration can be used without 
re-calibrating on the balancing variables: when, condi- 
tionally on the calibration variables, we can reasonably as- 
sume that the balancing variables are no longer correlated to 
the variables of interest. This can occur when the balancing 
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and the calibration variables are the same variables 
measured at different moments, and the calibration variables 
are more recent. 

When the determination coefficient between the interest 
variable and the auxiliary variables is equal to or close to 
one, then calibration is more efficient because of the 
rounding problem of balanced sampling. Anyway the most 
efficient strategy always consists of using balanced sam- 
pling and calibration together (see the simulation in Deville 
and Tillé 2004), 


7.5. Accuracy of the balancing equations 


It is possible to prove, under realistic assumptions (see 
Deville and Tillé 2004), that with the cube method 
XY #EaY 


j 


< O(p/n), 


j 
where p is the number of variables, and O(x)/x is a quan- 
tity that remains bounded when x tends to infinity. With 
simple random sampling 
Kee 
Gs 


J 


Ol Lin); 


where O,(x)/x is a quantity that remains bounded in 
probability when x tends to infinity. 

The gains in accuracy are therefore considerable. The 
small rounding problem can be fixed by a small calibration. 
The rounding problem comes from the fact that selecting a 
sample is an integer problem. It also occurs in stratification, 
which is a particular case of balancing. In stratification with 
proportional allocation, the sums of the inclusion proba- 
bilities in the strata are generally not integers. So, the sample 
sizes in the strata are obtained by rounding the sum of 
inclusion probabilities in the strata. The cube method does 
this rounding automatically and randomly in such a way as 
to ensure that the inclusion probabilities are exactly satisfied. 


7.6 Balanced sampling in repeated surveys 


An important difficulty occurs in repeated sampling. The 
problem comes from the fact that, when a balanced sample is 
selected with unequal inclusion probabilities, the comple- 
mentary sample is not necessarily balanced. Indeed, the 


equality 


does not imply that 


ee. 


pec Ty. keU 


This problem occurred in the French master sample. In this 
sampling design, the primary units, which are geographical 
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areas, are selected with unequal probabilities that are 
proportional to the size. After selecting the sample, some 
regions asked for complementary samples of areas that were 
not already selected. This question is intricate, because the 
complementary sample of a balanced sample is no longer 
balanced, and the aim is thus to select a balanced sample 
from a part of the population that is no longer balanced. 
Tillé and Favre (2004) gave a few methods to co-ordinate 
balanced samples, which were selected with unequal inclu- 
sion probabilities. More generally, the coordination (in the 
sense of managing overlap) of balanced samples can be 
difficult when the sampling design is balanced. 

While challenging, it is possible to organize rotations if 
all the samples are selected together and the samples are 
selected with equal inclusion probabilities. Indeed, in this 
case the complementary S =U\S of the samples S is 
also a balanced sample. A second balanced sample can be 
directly selected from S and so on. This method was used 
to create five rotation groups in the French master sample. 
The five groups are five balanced samples of municipalities. 

If the samples are selected with unequal inclusion 
probabilities, some solutions are described in Tillé and Favre 
(2004). An interesting particular case can easily be solved: 
when two non-overlapping samples must be selected with 
the same unequal inclusion probabilities m, < 0.5 from the 
same population. First, a sample S, balanced on x, must 
be selected with inclusion probabilities m,, = 27, such that 


Next, a sample S, can be selected from S,. This sample 
must be selected with inclusion probability m,, = 0.5 and 
must be balanced on x,/2n,, which gives the following 
balancing equations: 


>y X,/(2m,) _ 
keS, 1/2 keS 4 27, keU 


The sample S, = S,\S, is also balanced. 

If the population changes over times (deaths and births), 
the organization of a rotation becomes much more difficult. 
This difficulty already occurs with stratified samples. Never- 
theless, for stratification, several reasonable solutions exist 
(see, amongst others, De Ree 1999; Hesse 1998; Riviere 
1999; Nedyalkova, Péa and Tillé 2006). 


7.7 Main implementations of balanced sampling 


An SAS/IML® implementation was first programmed by 
three students of the Ecole nationale de la statistique et de 
l’analyse de l’information (Ensai) (Bousabaa eral. 1999). 
An official version of the Institut National de la Statistique 
et des Etudes Economiques done by Tardieu (2001) and 
Rousseau and Tardieu (2004) is now available on the Insee 
Web site. Another SAS/IML” version done by Chauvet and 
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Tillé (2005b, a, 2006) is also available on the University of 
Neuchatel Web site. In R language, the sampling package 
(Tillé and Matei 2007) allows us to use the cube method. 
These software programs are free, available over the 
Internet and are easy to use. 

The available programs written using R language or 
SAS/IML® have no limit as far as population size is con- 
cerned. An application with 40 balanced variables is possible. 
In order to select the sample, the computation times increase 
with N x p*’, where N is the population size and p the 
number of balancing variables. It is thus possible to select a 
sample in a population of several million statistical units. 
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Innovations in survey sampling design: 
Discussion of three contributions presented at the U.S. Census Bureau 


Jean Opsomer ' 


1. Introduction 


The U.S. Census Bureau is one of the largest survey data 
collection organizations in the world, in addition to its role 
in the collection of the U.S. Decennial Census data. The two 
major statistical tools used by the Census Bureau in de- 
signing its surveys are stratification and multi-stage sam- 
pling. These tools have been successfully implemented 
starting in the 1940s and have continually been adapted and 
refined since then. 

While this general sampling approach has been very 
successful, there are increasing concerns about rising survey 
costs, decreasing response rates and new frame coverage 
issues (especially related to telephones). At the same time, 
advances in data collection methods, new data sources and 
computational tool offer opportunities for considering 
survey design approaches that would have been unfeasible 
before. In conjunction with the 2010 Redesign Program 
currently on-going at the Census Bureau, input was there- 
fore sought from leading academic researchers in innovative 
sampling methods, as a way to initiate the exploration of 
possible new approaches to design surveys conducted by the 
Census Bureau. As a result, Profs. Steve Thompson (Simon 
Fraser University), Sharon Lohr (Arizona State University) 
and Yves Tillé (Université de Neufchatel) were invited to 
give overview lectures on some of the designs they de- 
veloped. I was invited to contribute a discussion to each of 
these lectures. 

In the three sections that follow, I will summarize my 
comments to each of these lectures. My goals in those 
comments were to highlight the most important aspects of 
the sampling methods that were presented, to discuss some 
of the main opportunities for using these designs in the 
household sampling context, and to identify possible 
challenges in implementation. 


2. Adaptive network and spatial sampling 


Prof. Thompson’s lecture covered a broad class of de- 
signs that includes adaptive cluster sampling, network 
sampling and adaptive web sampling. Unless I am referring 
to a specific design within this class, I will refer to these 
designs as “adaptive sampling” in what follows. A major 


advantage of adaptive sampling is that it incorporates some 
of the features of “convenience” sampling approaches such 
as snowball sampling, including decreased reliance on a 
sampling frame and the ability to target sampling to portions 
of the population of particular interest. But unlike conve- 
nience sampling, adaptive sampling remains firmly design- 
based, in the sense of allowing randomization-based finite 
population estimation and inference. 

In adaptive sampling procedures, an initial sample yl 
drawn according to a probability sampling design py (So). 
Based on the characteristics of the elements in s, (eg., 
presence/absence of features of interest or an enumeration of 
“links” to other elements in the population), a follow-up 
sample s, is selected from the remaining population, using 
a conditional sampling design p,(s, | s,). This process is 
repeated with successive incremental samples s,, Cp Re 
until a target criterion such as overall sample size or number 
of sampling “waves” is reached, and the final sample is the 
union of each of the successive samples. The specifics on 
how the waves are drawn varies by adaptive design. Section 
2.2 of Thompson’s article in this issue and Thompson 
(2006) provide additional details for adaptive web sampling, 
a very flexible type of adaptive sampling that includes many 
of the other designs as special cases. 

Because the designs for each of the sampling waves are 
probability designs, it is possible to obtain valid design- 
based estimators. A simple estimator for the finite popu- 
lation mean ty = N'SYyy, is constructed as follows. 
Based on the initial design p, with associated inclusion 
probabilities 7,,, an unbiased estimator for the population 
mean is given by fi, = N™ as y; /%;. For each of the sub- 
sequent waves k = 1,..., K, an unbiased estimator of 1, 
is given by z, = Xs Ve as, /yj, Where q,, are condi- 
tional inclusion probabilities for wave k (see Thompson 
(2006) for details on construction of the q,;. and Section 2.4 
of Thompson’s article in this issue for specific examples). 
Letting f, = K'>*,z,, an unbiased estimator for Lie is 
obtained as fi = wf, + (1 — w) fi, which is a linear com- 
bination of the initial estimator and the mean of the sub- 
sequent estimators. 

The estimator fi is design unbiased but it depends on the 
order of the waves in which the sample was obtained. A 
more precise estimator can be obtained by averaging over 
all the different orders in which the same sample could have 


I. Jean Opsomer, Department of Statistics, Colorado State University, Fort Collins, CO 80523-1877. E-mail: jopsomer@stat.colostate.edu. 


228 


been obtained. For small sample sizes, an explicit expres- 
sion is available for this more efficient estimator, but in 
general it needs to be approximated by repeated sampling 
from an appropriately defined Markov chain, and taking the 
mean of the samples. The exact methods for setting up the 
chain and drawing the samples are described in Thompson 
(2006), which also discusses variance estimation for the 
resulting estimator. 

One of the primary advantages of adaptive sampling 
designs is that they allow the survey organization to focus 
the sample in portions of interest in the population. This is 
particularly useful in situations where some of the elements 
of interest are relatively rare and where they cannot be 
identified a priori in a sampling frame. Examples of such 
situations are surveys of hunting and fishing behavior, 
recent immigrants, home-schoolers, or owners of family- 
owned businesses. In each of these cases, the elements are 
quite “diffuse” in the population and no comprehensive 
frame is generally available. However, it is likely that 
individuals who are part of this population will be able to 
provide information on other individuals, so that links can 
be identified and sampled across different adaptive sampling 
waves. Note that adaptive sampling can also be used when 
these types of rare elements are part of a subpopulation of 
interest within a survey of a larger and non-rare population. 
For instance, a survey of school children might want to 
include a stratum of home-schooled children. 

Finding relatively rare (sub)populations is a common 
challenge in surveys, and a number of methods are regularly 
deployed to deal with this issue. Perhaps the most common 
sampling design in the context of household surveys is 
stratified multi-stage sampling. To the extent that relevant 
PSU-level auxiliary information is available, the survey 
organization can oversample PSU expected to contain a 
larger fraction of the groups of interest. An example of such 
a situation is a survey of African-American males at risk of 
Parkinson’s disease, in which Census tracts with higher 
African-American population fraction could be oversam- 
pled. Another sampling design that can be useful in this 
context is multi-phase sampling. In this case, the first phase 
of sampling is used either as a screening sample or as a way 
to collect relevant auxiliary information, while subsequent 
phases focus on obtaining the survey data of interest. The 
Agricultural Resource Management Survey (ARMS) con- 
ducted by the USDA follows this design. A sample of all 
farms is selected in phase 1, in which farm characteristics 
for the survey year are collected. In later phases, targeted 
sampled based on the commodities of interest (e.g., dairy, 
wheat, etc) are selected. A third sampling approach that is 
sometimes useful for obtaining samples of rare (sub)popu- 
lations is multi-frame sampling. The principle underlying 
multi-frame sampling is to combine several frames with 
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different coverage characteristics, for instance a “good” 
frame containing a large proportion of elements of interest 
but potentially incomplete and a “bad” frame that is compre- 
hensive but contains a low proportion of elements of inter- 
est. For instance, a survey of companies in a particular 
industry might be able to use an industry group membership 
list as the “good” frame and a general company list as the 
“bad” frame. For a more in-depth look at multi-frame sam- 
pling, see Section 3 below. 

Compared to these three designs, adaptive sampling is 
more flexible and allows finer control over the number and 
characteristics of elements that are included in the sample, 
which will often result in improved efficiency and/or lower 
cost. A drawback of adaptive sampling is that information 
needs to be collected on the linkages between elements, 
which can increase respondent burden and data collection 
cost, and potentially raises confidentiality issues. 

Because adaptive sampling frequently relies on “links” 
between elements in order to define the conditional selec- 
tion probabilities in the sampling waves, it is also parti- 
cularly well-suited for surveys that are interested in studying 
connections between elements in a population. Examples of 
such situations might be surveys involving transactions or 
relationships between businesses, surveys of barter/trading 
behavior of households, and surveys of family network 
relationships or characteristics. 

For a survey organization contemplating adoption of 
adaptive sampling, a number of issues related to estimation 
and data dissemination need to be considered. In many 
cases, the survey data are released in the form of a weighted 
dataset, and variance estimates are provided in the form of a 
simplified design description (e.g., strata and PSUs), repli- 
cate weights or generalized variance functions. It is also 
very common for the weights to be calibrated and/or 
adjusted for non-response. Estimators for adaptive designs 
are indeed expressible as weighted sample sums, so that a 
weighted dataset could readily be created even for the 
Markov chain version of the estimators mentioned above. 
The choice of how to best provide variance estimates with 
the dataset is something that still needs to be investigated 
and might depend on the specifics of the survey. Similarly, 
how to incorporate calibration and nonresponse adjustments 
in adaptive sampling estimation is an area where additional 
research is needed. 


3. Sampling with multiple overlapping frames 


Prof. Lohr gave a comprehensive overview of general 
sampling designs and estimation methods when sampling 
uses multiple frames. Traditional approaches for conducting 
surveys are increasingly called into question today, because 
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of increasing costs, decreasing response levels for traditional 
modes, and increasing concerns for undercoverage of ex- 
isting sampling frames (e.g., landline telephone numbers 
reached by RDD). By drawing samples from several frames 
instead of from a single frame, it is possible to reduce 
survey costs, improve the coverage of the overall sample, 
and potentially even increase response rates depending on 
the specific survey being conducted (for instance, because 
of improved respondent identifier information in one of the 
frames). 

Multiple frame sampling is a pure randomization-based 
approach to draw samples, and sampling within the indi- 
vidual frames follows the same methodology as “classical” 
single-frame sampling. Fully design-based estimation meth- 
ods for multiple-frame sampling are available, several of 
which can readily be deployed in the large-scale survey 
context in which a weighted dataset is the primary output 
(see below). The key feature of all estimation methods is the 
estimation of the frame overlap, which is typically unknown 
but needs to be accounted for. This is done by, for each 
frame, constructing design-based estimators for the sub- 
population(s) of elements that also fall in the other frame(s). 
The estimators for the characteristics of the frame inter- 
section(s) then need to be combined across frames. Existing 
methods differ in how they combine these estimators, with 
the simplest methods using sample-size weighted averages 
and more complex estimators weighting by estimates of the 
precision of the individual estimators. 

Sampling from multiple frames is particularly applicable 
in cases where no single frame is available that covers the 
whole population. Typical examples of such situations are 
RDD sampling, where an increasing fraction of the popu- 
lation is not reachable through a landline telephone number, 
surveys of professionals or businesses with partial listings 
available from vendors or professional organizations. Other 
situations in which multiple frame sampling might be appli- 
cable are surveys of rare subpopulations that exist within a 
larger population. An overall frame for the population 
exists, but screening respondents for whether they belong 
the the subpopulation is time-consuming and expensive. An 
alternate frame containing a much higher proportion of 
elements from the subpopulation of interest is sometimes 
available, but if the coverage of that frame is incomplete, the 
survey organization might not be willing to rely on it for 
fear of not obtaining a valid sample. Combining the 
alternate but incomplete subpopulation frame with the 
complete but inefficient population frame might be both 
cost-effective and statistically defensible. Examples of 
surveys of such subpopulations are surveys of hunting and 
fishing, where a license frame often exists but it might be 
incomplete or out of date. This multiple frame approach 
might also be useful for a survey of the general population, 
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as a way to increase the sample size within certain sub- 
populations of particular interest. For instance, in a general 
survey of farms, it might be of interest to produce estimates 
for organic farms, which only represent a small fraction of 
farms but with many of those listed in organic business 
directories. Section 1 of Lohr’s article in this issue gives 
several additional examples of the wide applicability of 
multiple frame surveys. 

As noted above, estimation methods involve the con- 
struction of estimators for the frame intersection subpopu- 
lation, which requires selection of a weighting method for 
the estimators obtained from the different frames. Weighting 
methods that rely on estimating the precision of these esti- 
mators might be preferred from an efficiency perspective. 
However, they are somewhat problematic to implement in 
practice, because the resulting weights can vary for different 
variables in the survey. More practical approaches will 
forego some efficiency in order to be able to have single 
weights for all survey variables, a key feature emphasized 
repeatedly in Lohr’s article in this issue. The pseudo- 
maximum likelihood (PML) method of Skinner and Rao 
(1996) produces a single set of weights and is recommended 
by Lohr as the method of choice for single surveys, while a 
simpler fixed-weight approach is preferable for longitudinal 
surveys. 

While the basic methodology for constructing design- 
based estimators for multiple frame sampling is in place 
today, there is still a need for further research in approaches 
for applying calibration and nonresponse adjustment in this 
context. Because it is possible to apply those adjustments at 
the individual frame level, the population level, or both 
levels (depending on the available auxiliary information), an 
investigation of the properties of the estimators under these 
different scenarios would be very useful, and should be used 
to develop guidelines for survey practitioners. Section 3 of 
Lohr’s article in this issue discusses some initial results in 
this area. 

Variance estimation methods for multiple-frame esti- 
mators have been developed and are reviewed in Section 4.2 
of Lohh’s article, and include both linearization and repli- 
cation approaches. An important practical issue in the use of 
the linearization approach is that it requires access to the 
frame identification for all the elements in the sample, 
because it involves separate estimation of the variance in 
each frame. This might be undesirable for the survey organi- 
zation producing the data, for reasons of data confiden- 
tiality. In the case of replication methods such as jackknife 
and bootstrap, it is possible for the survey organization to 
create sets of replicate weights that do not require disclosure 
of the frame identity of individual sample elements to the 
data users. Lohr (2007) recommends the combined boot- 
strap approach for inference for multiple frame sampling. 
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As an alternative, the grouped jackknife of Kott (2001) 
could also be considered. 

Implementing multiple frame sampling surveys can be 
more challenging than single-frame surveys. There needs to 
be awareness for the increased potential for nonsampling 
errors, as discussed in Section 5 of Lohr’s article, especially 
if the data collection modes or protocols vary across frames. 
For instance, sampled elements in one frame get an advance 
letter, while those in another frame receive a “cold call” 
because of lack of address information. It is also possible 
that the nonresponse characteristics differ across frame, so 
that separate adjustments are required. Finally, in many 
cases the elements present in the different frames might 
have different characteristics (e.g., organic farms belonging 
to a national organic business association vs. those that do 
not). In all those cases, attention to frame-specific effects 
and careful weight construction are required in order to 
obtain valid survey estimators. On the other hand, the 
presence of multiple frames provides opportunities for 
measuring nonsampling errors, because they entail multiple 
samples from the same population. For instance, it might be 
useful to perform “cold calls” for a portion of the selected 
elements in the frame with addresses to evaluate mode 
effects. 


4. Balanced sampling with the cube method 


The presentation by Prof. Tillé covered the fundamentals 
of balanced sampling and described the cube method, which 
he developed as a practical algorithm implementing the 
drawing of balanced samples. The goals of balanced sam- 
pling designs are to maintain the representation of the 
population structure in the sample (hence the term “bal- 
ance”), and to improve the efficiency of survey estimators. 
Today, most survey statisticians apply stratification as the 
primary tool to achieve these two goals. Stratification 
achieves balance by forcing the sample composition to 
match the stratum allocation, and improves the efficiency of 
estimators by removing the component of variance due to 
between-stratum differences. Systematic sampling is also 
used to achieve these goals, most commonly in natural re- 
source surveys. In this case, the sample composition 
matches the population composition exactly along the 
sorting variable, and approximately for any variable corre- 
lated with the sorting variable. Efficiency is gained because 
sample moments of the variables of interest (approximately) 
match population moments. While both approaches are 
widely used and work well, they are relatively inflexible. 
Stratification often involves dividing the population into 
“cells” defined by the intersection of stratification variables, 
which might lead to a proliferation of many small cells with 
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corresponding small sample sizes. Systematic sampling is a 
highly constrained form of sampling with limited amount of 
flexibility in sample construction, and with the additional 
issue of the lack of a design-based variance estimator. 

Balanced sampling can be viewed as a generalization of 
stratification. Under this interpretation, stratified samples are 
drawn with given probabilities of inclusion for all the popu- 
lation elements, but subject to constraints on the sample size 
in each stratum. In balanced sampling, the stratification 
constraints are replaced by constraints of the form 
yx, /m, = Ly x; where x, is a vector of balancing 
variables. When the x, are stratum indicators, balanced 
sampling is the same as stratification, but any categorical or 
continuous variables (or combination thereof) can be used, 
which provides a high degree of flexibility in sample 
construction. 

As noted above, the cube method is an algorithm that 
draws balanced samples given a set of inclusion probabili- 
ties and constraints. If exactly balanced samples exist in the 
population, the algorithm will try to select one of them. If no 
sample can be found that has the postulated inclusion 
probabilities and satisfies the balancing constraints exactly, 
it will attempt to come as close as possible to satisfying the 
constraints. The cube method requires that the balancing 
variables x, be known for all elements in the population. 
Depending on the survey context, this requirement might 
represent a key limitation on the applicability of balanced 
sampling. 

Despite the fact that balancing on population-level 
auxiliary variables is done at the design stage, it seems 
likely that in practice, calibration and other weight adjust- 
ments such as for nonresponse will still often be required. In 
fact, Tillé recommends the combination of balancing and 
calibration as the most efficient strategy (see Section 7.4 of 
Tillé’s article in this issue). The theoretical properties of 
estimators that are both balanced and calibrated still needs to 
be fully worked out, however. 

While balanced sampling maintains the inclusion proba- 
bilities of the elements in the population, it is clear that the 
presence of the balancing constraints affects their joint 
inclusion probabilities and hence the variance of the esti- 
mators. This topic is addressed in Section 6 of Tillé’s article. 
Deville and Tillé (2005) showed that, under certain condi- 
tions, the variance of balanced sampling estimators can be 
approximated by a linearization-type variance, which de- 
pends on the residuals of a linear regression of the survey 
variables on the balancing variables. While this is an 
important and useful result, it does not lead to a variance 
estimation approach that is applicable to all survey ap- 
plications. One issue is that variance estimation based on 
this method requires access to the balancing variables for all 
the survey respondents, and these might not be made 
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publicly available as part of the survey dataset. In this 
context, a replication-based method might be particularly 
attractive, because it would not require releasing these 
variables. However, no such method is currently available. 
Balanced sampling has close connections with rejective 
sampling, which aims to achieve the same goals. In rejective 
sampling, a sample is drawn with prespecified inclusion 
probabilities and the sample is accepted or rejected based on 
whether it is within a given tolerance level of a balancing 
constraint. If the sample is rejected, the procedure is 
repeated until a sample is found that falls within the 
tolerance level. While rejective sampling has a long history, 
Fuller (2009) described some asymptotic theory that showed 
that asymptotically, his version of rejective sampling was 
approximately equivalent to balanced sampling. 


5. Closing remarks 


The methods covered in the three lectures are remarkably 
complementary. Adaptive designs make it possible to obtain 
randomization-based, statistically valid samples for popu- 
lations that have traditionally been difficult to sample 
efficiently. Very little frame information is required to draw 
such a sample, but a significant amount of effort has to be 
expended during the data collection in order to identify and 
follow the “links” among the elements, and draw the 
successive samples. In contrast, balanced sampling is useful 
when very detailed frame information is available, and in 
that situation, it allows for highly customized and efficient 
sample designs. Once a balanced sample is drawn, the data 
collection can proceed in the same manner as for traditional 
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surveys. Multiple frame sampling covers an intermediate 
case, in the sense that no single good frame exists but 
several partial frames are used to “offset” each other’s 
weaknesses. Separate samples are drawn from each frame, 
and data collection proceeds as usual, except for that fact 
that it is necessary to determine which frame(s) each 
sampled respondent belong to. 

Combined with the existing approaches already in use, 
these three new sampling methods have the potential to 
greatly increase the flexibility with which samples can be 
customized for specific applications, to reduce survey costs 
and to increase the precision of survey estimators. 
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ANNOUNCEMENTS 


Nominations Sought for the 2013 Waksberg Award 


The journal Survey Methodology has established an annual invited paper series in honour of 
Joseph Waksberg to recognize his contributions to survey methodology. Each year a prominent survey 
statistician is chosen to write a paper that reviews the development and current state of an important topic in 
the field of survey methodology. The paper reflects the mixture of theory and practice that characterized 
Joseph Waksberg’s work. 


The recipient of the Waksberg Award will receive an honorarium from Westat. The paper will be 
published in a future issue of Survey Methodology. 


The author of the 2012 Waksberg paper will be selected by a four-person committee appointed by Survey 
Methodology and the American Statistical Association. Nomination of individuals to be considered as 
authors or suggestions for topics should be sent before February 28, 2012 to the chair of the committee, 
Mary Thompson (methomps@uwaterloo.ca). 


Previous Waksberg Award honorees and their invited papers are: 


2001 


2002 


2003 


2004 


2005 


2006 


2007 


2008 


2009 


2010 


2011 


2012 


Gad Nathan, “Telesurvey methodologies for household surveys — A review and some thoughts 
for the future?”. Survey Methodology, vol. 27, 1, 7-31. 

Wayne A. Fuller, “Regression estimation for survey samples”. Survey Methodology, vol. 28, 1, 
5-23. 

David Holt, “Methodological issues in the development and use of statistical indicators for 
international comparisons”. Survey Methodology, vol. 29, 1, 5-17. 

Norman M. Bradburn, “Understanding the question-answer process”. Survey Methodology, vol. 
30, 1, 5-15. 

J.N.K. Rao, “Interplay between sample survey theory and practice: An appraisal”. Survey 
Methodology, vol. 31, 2, 117-138. 

Alastair Scott, “Population-based case control studies”. Survey Methodology, vol. 32, 2, 
123-132. 

Carl-Erik Sarndal, “The calibration approach in survey theory and practice”. Survey 
Methodology, vol. 33, 2, 99-119. 

Mary E. Thompson, “International surveys: Motives and methodologies”. Survey Methodology, 
vol. 34, 2, 131-141. 

Graham Kalton, “Methods for oversampling rare subpopulations in social surveys”. Survey 
Methodology, vol. 35, 2, 125-141. 

Ivan P. Fellegi, “The organisation of statistical methodology and methodological research in 
national statistical offices”. Survey Methodology, vol. 36, 2, 123-130. 

Danny Pfeffermann, “Modelling of complex survey data: Why model? Why is it a problem? 
How can we approach it?”. Survey Methodology, vol. 37, 2, 115-136. 

Lars Lyberg, Manuscript topic under consideration. 
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Members of the Waksberg Paper Selection Committee (2011-2012) 


Mary Thompson, University of Waterloo (Chair) 
J.N.K. Rao, Carleton University 
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Cynthia Clark, USDA 


Past Chairs: 


Graham Kalton (1999 - 2001) 
Chris Skinner (2001 - 2002) 
David A. Binder (2002 - 2003) 

J. Michael Brick (2003 - 2004) 
David R. Bellhouse (2004 - 2005) 
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